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SECTION  I 


INTRODUCTION  AND  SUMMARY 


The  Multimode  CPU  Design  Study  was  undertaken  by  the  Air  Force  and  Litton 
Data  Systems  Division  to  define  a multimode  CPU  architecture,  to  assess  the  microprocessor 
design  (IC  process  interdependencies  intrinsic  to  the  processor  approaches),  to  establish  a de- 
tailed chip  design  (at  the  register  level)  for  an  advanced  8-bit,  bit-sliced  processor  element, 
and  to  assess  the  multimode  chip  design  set  and  existing  CPUs  for  their  ability  to  perform 
processing  tasks. 

The  problem  set,  defined  in  Section  II,  is  a representative  set  of  signal  processing 
tasks  required  through  a major  portion  of  the  1980’s.  The  set  was  used  as  the  benchmarks 
by  which  the  Multimode  CPU  (MMCPU)  design  could  be  bounded. 

The  architectural  design  was  attempted  initially  without  the  constraints  of  LSI  tech- 
nology. The  results  are  presented  in  Section  III.  The  design  process  started  from  the  bus 
system  (data  and  data  addressing)  and  work  out  to  a CPU  architecture.  Because  the  FFT 
represents  the  most  difficult  of  the  problems  in  the  set,  the  impact  of  the  multiplier/FFT 
special  function  structure  was  investigated  and  two  processor  structures  were  presented  (see 
Figures  1 and  2).  The  various  blocks  of  the  processor  were  analyzed  and  two  register  arith- 
metic logic  unit  (RALU)  structures  were  defined  (Figures  3 and  4).  Each  RALU  is  designed 
to  perform  both  the  data  processing  (DP)  and  data  addressing  (DA)  functions.  An  instruc- 
tion addressing  and  microcontrol  structure  was  defined.  Figure  5 is  the  instruction  addresser 
without  its  microinstruction  memory. 

Before  the  feasibility  of  the  architecture  as  an  LSI  candidate  could  be  tested,  the 
state-of-the-art  was  assessed.  The  results  are  presented  in  Section  IV.  The  gate  level  design 
of  the  RALU  structures  are  presented  in  Section  V,  concluding  that  an  8-bit,  bit-sliced 
RALU  for  the  DP/DA  functions  is  feasible. 

Three  microcomputer  architectures,  one  based  on  the  Tracor/RCA  GPU,  one  based 
on  the  Litton  DP/DA  RALU,  and  one  embodied  in  the  Raytheon  Micro  Signal  Processor, 
were  assessed  on  their  ability  to  perform  the  benchmarks  from  the  problem  set  discussed  in 
Section  II.  The  methodology,  initial  assumptions,  self-imposed  constraints,  and  results  and 
conclusions  are  presented  in  Section  VI. 

In  summary,  the  attempt  to  design  a single  large  scale  integrated  circuit,  the 
MMCPU,  has  revealed  some  interesting  insights  into  the  signal  processing  environment,  LSI 
technology,  and  processor  design.  Analysis  showed  that  the  main  functions  in  Section  III  are 
all  within  the  reach  of  current  LSI  technology,  but  two  chip  types  will  be  necessary  to 
accomplish  the  total  function  of  the  Multimode  CPU. 


Figure  2.  Processor 
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SECTION  II 

GENERAL  SIGNAL  PROCESSING  PROBLEM 


2.0  INTRODUCTION 

The  objective  of  this  study  is  to  define  and  do  a top  level  design  of  LSI  circuitry  that  will 
have  significant  impact  upon  the  capabilities,  costs,  environmental  factors  and  performance  of 
future  military  systems  that  must  deal  with  the  information  content  of  analog  waveforms 
(signals)  to  perform  their  assigned  tasks.  The  systems  or  portions  of  systems  directly  addressed 
in  this  program  consist  of  those  techniques,  in  both  mathematics  and  implementation,  used  to 
transform  the  signal  information  content  into  a form  suitable  for  a known  user  (whether  the 
user  is  a human,  an  automated  tracking  system,  etc.,  depends  upon  the  actual  system  being 
implemented). 

Die  problem  will  be  further  bounded  by  assuming  that  (lie  required  signal  processing  will  be 
performed  using  digital  circuitry  and  that  recommended  circuits  should  show  the  promise  of 
spanning  a wide  range  of  applications.  These  circuits  are  intended  for  use  in  the  construction 
of  systems  from  1980  through  approximately  1990.  most  of  which  have  not  as  yet  been  con- 
ceived. An  initial  effort  has,  therefore,  been  expended  in  trying  to  predict  the  direction  of 
applications  of  digital  signal  processing  to  military  problems  for  the  next  10  to  15  years.  It 
was  recognized  that  this  program  could  do  more  than  aid  these  applications  but.  if  properly 
executed,  would  speed  and  alter  the  course  of  new  applications.  Care  was  exercised  during 
the  study  of  present  and  "drawing  board"  systems  for  use  as  a basis  of  predicting  future  sys- 
tem requirements.  The  actual  system  goals  were  studied  rather  than  specific  implementations 
that  exist  as  approximations  to  desired  systems.  The  desired  systems  often  cannot  be  produced 
cost  effectively  using  todays  circuits. 

The  tield  ot  applications  to  be  addressed  can  be  made  clearer  by  first  considering  the  nature 
of  signal  processing  itself.  Signal  processing  problems  are  typified  by: 

a.  Analog  signal  or  signals  to  be  processed  (in  this  case  digitally,  thus  requiring  A/1) 
converters). 

b.  An  uncooperative  (noisy)  environment  which  corrupts  the  desirable  signals. 

c.  Low  intormation-rate-to-data-rate  ratio  permitting  averaging  for  signal-to-noise  ratio 
improvement. 

Due  to  the  uncooperative  environment  and  low  information-rate-to-data-rate  ratio,  the  incoming 
signal  can  be.  and  most  ofter  is.  converted  to  what  mathematicians  call  "sufficient  statistics". 
The  idea  is  to  transform  the  large  amount  (and  often  highly  redundant)  incoming  data  into  a 
relatively  small  amount  of  data  which  contains  all  or.  in  practice,  almost  all  of  the  information 
content  of  the  initial  signal.  Once  this  transformation  is  performed  subsequent  processing  and 
memory  requirements  simplify  because  less  data  must  be  handled  at  each  processing  step. 

Signal  processing  tasks  can  be  separated  into  high  speed  and  low  speed  processing  requirements 
because  of  the  sufficient  statistic  concept  where  high  speed  and  low  speed  are  relative  to  input 
sampling  rate  for  a specific  problem  and  arc  not  absolute.  For  example,  the  high  speed  proc- 
essing of  a sonar  problem  may  be  slower  than  the  low  speed  processing  of  a radar  problem  in 
terms  of  actual  hardware  requirements.  The  dependence  of  processing  speed  on  sampling  rate 
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and  the  high  and  low  speed  processing  requirements  points  out  the  fact  that  for  many  signal 
processing  jobs  the  total  job  could  be  done  using  a programmable  CPU  approach  but.  is  sam- 
pling rates  increase,  a point  will  be  reached  where  high  speed  processing  will  have  to  be  out- 
boarded  and,  tor  high  rates,  pipelined  for  maximum  throughput.  The  low  speed  requirements 
could  almost  always  be  performed  in  a programmable  CPU,  however. 

Characterization  of  the  signal  processing  problems  in  the  1980-1  WO  time  frame  can  be  accom- 
plished by  considering  the  basic  nature  of  such  problems  and  the  analytical  techniques  being 
used  to  address  the  problems.  The  objective  is  always  one  of  determining  and  suitably  for- 
matting the  information  content  ol  a signal  which  has  been  corrupted  by  an  uncooperative 
environment.  Due  to  the  large  number  of  variables  in  this  type  of  problem  (number  of  samples, 
system  states,  etc.)  problems  are  cast  in  the  form  of  matrix  equations.  This  method  of  analysis 
permits  a certain  ease  of  manipulation  of  large  numbers  of  variables  and/or  equations.  As  a 
direct  result,  the  lirst  statement  ol  the  problem  solution  is  in  the  form  of  a matrix  equation  or. 
more  typically,  the  equations  for  computing  sufficient  statistics  are  matrix  equations. 

Pie  implementation  problem  can  then  be  viewed  as  a reduction  of  these  matrix  equations  to 
the  point  where  they  can  be  implemented  by  existing  hardware.  In  the  past  this  has  required 
reducing  all  equations  to  Boolean  operations  because  design  was  done  at  the  gate  level.  As 
levels  ol  integration  increased,  the  Al  U became  a readily  available  part  so  that  algorithms 
needed  only  to  be  reduced  to  adds,  subtracts  and  logical  operations. 

Pie  very  common  operation  of  real  multiplication  has  recently  been  attacked  in  an  attempt  to 
reduce  it  to  cost-effective  hardware  and  it  seems  reasonable  to  expect  that  the  divide  problem 
will  also  become  available  as  a hardware  component.  Thus,  we  see  a common  approach  by 
commercial  semiconductors  to  ease  the  manipulation  of  real  scalar  quantities  in  numerical 
calculations. 

In  order  to  increase  the  hardware/algorithm  boundary  further  so  that  less  effort  will  be  required 
to  implement  signal  processing  algorithms,  hardware  needs  to  address  the  operations  involved  in 
complex  vector  and  matrix  mathematics.  It  is  unlikely  that  in  the  near  future  single  chips  will 
perform  such  functions  as  a vector  multiply,  but  rather  a CPU  that  has  a multiplier  under  its 
control  could  be  organized  so  that  vector  operations  become  easy  to  program  and  are  efficiently 
implemented. 

Pus  philosophy  indicates  that  the  direction  ol  thought  toward  defining  an  ideal  micro-signal 
processing  chip  set  be  such  that  the  chip  set  should: 

a.  Provide  a hardware  complex  multiply. 

b.  Control  the  multiply  and  memory  so  that  matrix  manipulations  are  extremely  efficient. 

c.  Simplify  the  programmers’  task  for  performing  matrix  calculations. 

d.  Not  compromise  the  ease  of  performing  scalar  arithmetic  and  logical  operations. 

e.  Be  capable  of  handling  large  I/O  data  rates. 
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MODELLING  SIGNAL  PROCESSING  PROBLEMS 


All  possible  signal  processing  problems  cannot  be  considered  during  the  design  of  a versatile 
signal  processor  and  a chip  set.  Instead,  the  problem  must  be  modelled  by  a small,  manage- 
able set  of  problems,  representative  of  the  baseline  scenario  from  which  general  computa- 
tional requirements  of  all  signal  processing  problems  can  be  derived.  With  the  concurrence 
ot  the  technical  staff  of  the  Processor  Technology  group  (AFAL/D1IE-1 ) of  the  Air  Force 
Avionics  Laboratory,  Litton  Data  Systems  used  the  benchmarks, discussed  in  this  chapter, 
as  a good  representative  set  of  problems  for  use  in  the  design  of  a Multimode  CPU  chip  set. 
The  following  paragraphs  briefly  discuss  each  benchmark  and  the  characteristics  of  the 
signal  processing  indicated  by  the  benchmark. 

I he  tirst  benchmark  is  a complex  1024  point  EFT.  In  addition  to  being  the  most  common 
signal  processing  benchmark  in  the  entire  signal  processing  industry  for  comparing  signal  pro- 
cessing equipment,  this  problem  illustrates  the  first  four  facets  of  signal  processing  listed  in 
Table  1. 

rhe  popularity  of  the  Fourier  Transform  over  other  transforms  involving  orthogonal  basis  func- 
tions is  by  no  means  accidental.  It  is  a direct  result  of  the  fact  that  the  Fourier  basis  func- 
tions (sine  and  cosine)  are  the  eigenfunctions  of  all  linear  systems  and.  therefore,  are  the  only 
functions  which  will  preserve  their  functional  form  (except  for  parameter  changes)  from  input 
to  output  of  any  linear  system  For  example,  if  A cos  (wt)  is  used  as  input  to  any  linear  sys- 
tem. the  response  will  be  of  the  lorm  Beos  (wt  + 0)  and  no  additional  frequencies  (basis 
tunctions)  will  be  produced  (this  cannot  be  claimed  for  Walsh  or  other  orthogonal  function 
representations).  Since  all  systems  are  modelled  as  linear  whenever  possible  due  to  the  enor- 
mous gain  m mathematical  simplicity,  it  is  assumed  that  Fourier  Transforms  will  continue  as 
the  most  popular  transform  technique  and.  as  implementation  becomes  less  costly . their  use  will 
grow  considerably. 

Die  basic  operation  performed  during  a Discrete  Fourier  Transform  is  the  multiplication  of  a 

vector  tunes  a matrix.  If  this  operation  could  be  solved  quickly  and  efficiently  by.  say.  a 

matrix  multiply  chip  there  would  no  longer  be  any  interest  in  the  collection  of  algorithms 
known  generally  as  the  East  Fourier  Transform.  However,  semiconductor  technology  will  not 
solve  the  matrix  problem  in  the  time  frame  under  consideration  so  that  taking  advantage  of  the 
cyclic  properties  of  the  Transform  matrix  (EFT)  will  continue  as  one  of  the  most  important 
signal  processing  computational  problems. 

The  second  benchmark  is  a modification  to  the  FFT  by  the  application  of  a windowing  func- 
tion to  the  data  to  reduce  the  side  lobe  effects  inherent  in  the  FFT  algorithm.  This  operation, 

it  performed  in  the  time  domain,  is  an  example  of  a high  speed  function  product  common  in 
modulation  demodulation  processes  and  digital  filtering  via  the  FIT.  If  performed  in  the  fre- 
quency domain  this  algorithm  is  an  example  ol  a high  speed  convolution  common  in  finite 
impulse  response  digital  filtering.  In  either  case,  this  benchmark  also  illustrates  the  facets  (I) 
through  (4)  listed  in  Table  I. 
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Table  1.  Important  Facets  of  Signal  Processing 


Benchmark  Facets  of  Signal  Processing 

I and  2 1.  High  speed  calculation  using  complex  arithmetic  on  data  arrays. 

2.  With  the  exception  of  numerical  scaling,  the  high  speed  algorithm  is  independent 
of  the  input  data. 

3.  Array  indexing  is  orderly  although  not  always  simple. 

4.  Multiply  add  is  a common  arithmetic  pair  of  operations. 

3 5.  Tight  data  dependent  loops. 

to.  Double  precision  during  less  time-critical  processing. 

7.  Numerical  scaling. 

8.  Data  dependent  jumps. 

4.  Sliding  window  data  manipulation. 

10.  Averaging  (integration). 

1 1 . Data  dependent  decisions. 

12.  Bit  manipulation. 

13.  High  I/O  lates  between  memory  and  the  outside  world. 

14.  High  speed  calculation  on  small  input  data  blocks. 

1 5.  Repetitive  use  of  a very  short  program. 

to  Ito.  Fast/tandem  address  generation. 

17.  Data  dependent  address  generation. 

18.  Efficient  memory  organization. 

14.  Data  comparisons. 


The  third  benchmark.  court! mate  conversion,  involves  the  computation  of  several  common 
functions  such  as  sine,  cosine,  divide,  square  root.  etc.  These  functions  are  common  to  a wide 
range  of  signal  processing  problems  particularly  modulation,  demodulation,  detection,  averaging 
and  standard  deviation  estimation.  Demonstrating  the  ability  to  handle  these  functions  will 
indicate  the  ability  to  handle  a variety  of  other  functions  such  as  logarithm,  anti-logarithm, 
error  function.  Marcum’s  Q function,  etc.,  in  a similar  manner  The  double  precision  require- 
ment is  indicative  of  the  fact  that  these  functions  are  often  required  to  great  precision  and 
that,  under  the  same  circumstance,  calculation  speed  is  generally  less  critical  so  it  would  make 
sense  to  use  double  precision  programming  rather  than  a machine  with  a larger  word  si/e.  In 
addition,  calculation  of  these  functions  often  involves  iterative  loops  of  a form  that  requires 
results  of  a calculation  before  the  next  calculation  can  be  performed  This  problem  illustrates 
the  signal  processing  characteristics  numbers  5.  <->.  7 and  8 as  listed  m Table  I. 

Die  fourth  benchmark  is  ai.  example  of  Constant  False  Alarm  Kate  ((TAR)  detection  commonly 
used  to  improve  performance  of  radars  particularly  when  operating  in  a high  clutter  environment 
llus  benchmark  illustrates  facets  4,  10.  II  and  12  of  Table  I. 

Hie  sliding  window  is  an  important  aspect  of  signal  processing  wherein  an  algorithm  is  applied 
to  a set  of  data  points  (in  this  case  averaging  and  a threshold  based  decision)  and  then  a new 
data  point  is  introduced  to  the  set  while  the  oldest  point  is  deleted  and  the  algorithm  is 
repeated  Conceptually  the  same  process  is  involved  in  finite  impulse  response  (transversal) 
filters,  convolution,  generation  of  algebraic  codes,  etc.  In  this  benchmark  the  output  is  a series 
of  0 I decisions  which,  for  memory  and  communication  efficiency  should  be  packed  into  com- 
puter words  (e  g . I(>  decisions  in  a 16  bit  word). 

The  fifth  benchmark  is  the  use  of  a Cosine  Transform  as  found  in  the  front  end  processing  ot 
an  image  bandwidth  compression  problem.  The  basic  algorithm  is  an  TFT  and  in  that  sense  is 
similar  to  benchmark  1.  This  problem  differs,  however,  in  tliat  a high  input  data  rate  is 
required  while  the  transform  itself  is  only  52  points.  The  problem  is.  therefore,  one  of  per- 
forming a short,  fast  operation  including  fast  I/O  thus  straining  processor  I O.  memory  address 
and  store  capabilities  and  calculating  power.  The  lacets  of  signal  processing  illustrated  by  this 
benchmark  are  15.  14  and  15. 

The  sixth  benchmark  is  a pulse  classification  algorithm  characteristic  of  general  pattern  recog- 
nition problems  but  specifically  oriented  toward  signal  sorting  for  electronic  warfare.  To 
accomplish  this  benchmark,  the  signal  processor  must  have  a highly  flexible  memory  organi- 
zation with  efficient  data  memory  control,  sophisticated  data  address  generation  for  subsequent 
data  processing,  data  dependent,  data  address  generation  and  conditional  jump, branch  capability. 

Because  the  specification  of  signal  sorting  algorithms  is  fairly  incomplete  in  the  literature,  1 itton 
Data  Systems  was  asked  to  help  extend  the  specification  for  the  Processor  Technology  group 
( AFAL/DIIF-1  ).  Included  in  the  succeeding  sections  is  a detailed  discussion  of  the  signal 
sorting  problem  and  the  general  FW  problem. 

The  use  of  a specific  benchmark  set  has  provided  greater  insight  into  the  general  properties 
that  a micro-signal  processor  chip  set  should  possess.  The  benchmarks  have  been  shown  to 
generally  represent  the  totality  of  signal  processing  problems  and  it  appears  certain  that  a 
machine  architecture  that  can  provide  features  l through  l>  of  Table  I will  be  applicable 
to  real  world  signal  processing  problems. 
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BENCHMARKS 


wN  =e 


-j2  it/N 


Fi(k)  = V 


f(2n)  W' 


F,(k)  = V 


f(2n+l)  W 


and  then  combining  Equations  (2)  and  (3)  to  achieve  the  result  of  Equation  (1)  as  follows 


F(k)  = F,  (k)  + W*  F->(k)  (4) 

i n 

This  process  can  be  continued  with  a significant  computational  advantage  gained  at  each  step. 

The  basic  operation  resulting  from  Equation  (4)  is  called  the  Decimation-in-Time  butterfly 
defined  as 

A1  = A + W^B 

B1  = A - W^’b  (5) 


2.2.1  Fast  Fourier  and  Weighted  Fast  Fourier  Transform  Benchmarks 

The  Discrete  Fourier  Transform  Equation  for  an  N point  complex  sequence  f(n)  is  defined 
as  follows: 


N-l 

L fin)  WkKn 
n=o  N 


k = 0,  1 N-l 


A decimation-in-time  Fast  Fourier  Transform  can  be  readily  derived  from  equation  ( 1 1 by 
defining  two  (N/2)  point  sequences  as  the  even  and  odd  members  of  f(n).  Because  of  the 
highly  cyclic  nature  of  Wjq,  it  can  be  shown  that  Equation  (1)  can  be  computed  by  tirst 
computing 


This  operation  can  be  visualized  in  flow  graph  form  as  follows. 


The  quantity  W*  is  commonly  called  a rotation  vector  or  ‘twiddle  factor’.  A combination 
of  butterflies  arranged  to  produce  a 32  point  complex  FFT  is  shown  in  Figure  6.  This 
structure  results  from  the  repetitive  application  of  Equations  (2).  (3)  and  (4)  and  the  butter- 
fly operation  Equation  (5).  The  same  process  can  be  used  to  define  a 1024  point  FFF  and 
it  is  this  algorithm  that  has  been  chosen  as  the  FFT  benchmark.  It  should  be  noted  that  th 
output  values  are  unordered  and  that  reordering  is  a necessary  step  to  be  included  in  the 
benchmark  problem. 

The  algorithm  will  assume  14-bit  input  data  which  will  be  sufficient  for  almost  all  signal  pro- 
cessing problems.  The  nature  of  the  algorithm,  however,  is  such  that  numerical  values  tend 
to  increase  through  a butterfly  operation.  The  largest  data  value  out  of  c butterfly  will  be 
at  least  as  large  but  no  greater  than  twice  as  large  as  the  largest  data  value  into  the  butter- 
fly. Overflows  can  be  prevented  by  dividing  by  two  at  the  output  of  every  butterfly,  but 
this  results  in  many  unnecessary  underflows.  The  scaling  scheme  employed  involves  keeping 
all  data  values  at  14  bits  and.  at  the  end  of  each  group  or  course,  checking  to  see  if  any 
data  value  has  overflowed  into  the  15th  bit.  Whenever  this  occurs  all  input  data  to  the  next 
course  will  be  divided  by  two.  This  scheme  insures  no  actual  overflows  while  maintaining  as 
much  precision  as  possible  in  the  final  results. 

The  computational  requirements  can  then  be  defined  as  follows: 

a.  512  butterfly  operations  during  each  of  10  courses 

b.  Fifteenth  bit  overflow  check  and  scaling  if  necessary-'after  each  course 

c.  Reordering  of  final  results  for  output-. 

A rate  of  5 msec  for  a 1024  point  FFT  implies  an  average  butterfly  rate  of  less  than  97b  nsec, 
but  time  must  be  allotted  for  processor  dependent  data  loading,  loop  set-up  and  control  and 
output  reordering.  This  time.  thus,  represents  an  absolute  upper  bound  to  the  actual  butter- 
fly time  which  will  depend  upon  processor  architecture. 

In  a similar  manner,  an  absolute  upper  bound  on  butterfly  time  for  a rate  of  0.5  msec  for  a 

1024  point  FFT  is  98  nsec  with  this  number  decreasing  as  a function  of  processor  architecture. 

/ 

The  additional  benchmark  of  windowing  the  FFT  data  will  be  performed  in  the  time  domain 
and  consists  of  a premultiplication  of  all  input  data  values  by  a window  function.  The 
technique  adopted  is  to  store  the  window  function  in  memory  thus  requiring  the  processor 
to  perform  1024  complex  multiplies  as  well  as  associated  fetches  and  stores.  This  technique 
permits  the  arbitrary  selection  of  any  window  function  at  no  additional  computational  cost 
since  the  function  will  be  prestored  in  memory. 
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Coordinate  Conversion  Benchmark  Definition 


The  coordinate  conversion  benchmark  of  polar-to-rectangular  and  rectangular-to-polar  in  single 
and  double  precision  demonstrates  the  fact  that  functions  generated  from  input  variables  are 
often  required  in  signal  processing.  Normally  they  occur  in  post-processing  applications  where 
relatively  low  speed  calculations  are  required.  Therefore,  the  use  of  double  precision  program- 
ming is  preferred  when  increased  accuracy  is  required  rather  than  implementing  a larger 
word-size  machine. 

The  polar-to-rectangular  conversion  problem  is  defined  as  follows: 

Given  a point  specified  by  magnitude  R and  angle  0,  determine  the  rectangular 
coordinates  X and  Y where 

X = R cos  0 


and 


Y = R sin  0 (6) 

The  problem  is  basically  one  of  computing  sine  and  cosine  functions  and  can  be  solved  using 
a nested  polynomial  approach  to  the  Taylor  series  for  either  function  over  tt/2  and,  by 
symmetry,  determining  the  functional  value. 

First  assume  that  sin  0 can  be  computed  for  0 < 0 ^ tt/2.  Then  0 can  be  mapped  to 

0 as  a function  of  the  quadrant  of  0 and  the  function  (sine  or  cosine)  desired  as  shown  in 
Table  2.  Therefore,  any  sine  or  cosine  value  can  be  computed  using  a Taylor  series 
expansion  for  sin  0 in  the  range  0 < 0 < tt/2.  The  expansion  is' 


sin  0 = 


" J 0 ( 2 J + 1 ) 
L H>  

J=0  (2j  + 1)! 


(7) 


The  expansion  must  be  limited  to  a finite  number  of  terms,  the  number  used  reflecting  the 
desired  numerical  accuracy.  Assume,  for  example,  the  first  five  terms  are  sufficient  for  a 
particular  application.  Then 


sin 


3! 


7!  9! 


(8) 
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Table  2.  Mapping  of  Sine  and  Cosine  to  One  Quadrant  of  the  Sine  Function 


There  are  three  function  problems  here;  square  root,  divide,  and  arc  sine.  A Taylor  series 
can  be  used  for  the  arc  sine  and  implemented  exactly  as  Equation  (11)  above  with  a different 
set  of  constants,  k.  The  series  is 


sin  "*0  = 9 + 


+_L  J_  oi  +J_  J_  _5_  oL 
6 2 ‘ 4 ‘ 5 2 ‘ 4 ' 6 ‘ 7 


It  is  possible  to  show,  using  the  Contraction  Mapping  Theorem,  that  a divide  can  also  lx* 
implemented  in  the  form  of  Equation  (11).  The  resulting  algorithm  is  as  follows 

V 

if  Z = £ and  -1  s Y < -.5  then  the  iteration 
Zn+1  = kX  + kYZn 

will  converge  at  a rate  exceeding  two  bits  per  iteration  to  7.  if 
kx  = -X(Y  + 2)  \ 

kv  = (Y  + I)2  / 


Z<)  = kX 


Note  that  kx  and  ky  remain  constant  for  all  iterations.  The  last  required  function  to 
complete  the  problem  is  the  square  root.  The  Contraction  Mapping  Theorem  can  again  lx* 
used  to  show  that,  if 

0.0625  s N < 0.5625 
then  the  recursion 

X|i+I  * n ' Xn  (16) 

will  converge  to  \ N where 

‘n  = N-X“  (H) 

The  algorithms  presented  above  represent  one  possible  solution  to  the  coordinate  conversion 
problem.  In  addition,  they  will  demonstrate  the  goals  of  the  benchmark  in  that  the>  require 
a flexible  processor  capable  of  handling  real  and  double  precision  data. 


2.2.3  Constant  False  Alarm  Rate  (CFAR)  Benchmark  Definition 


The  CFAR  benchmark  is  defined  by  assuming  that  6-bit  positive  values  are  input  from  the 
detection  stage  of  a radar  with  the  following  characteristics: 

a.  A 70-mile  range 

b.  A 200-nanosecond  compressed  pulse  width 

c.  A 1-millisecond  pulse  repetition  interval 

This  implies  that  4352  data  points  are  to  be  processed  resulting  in  4096  binary  decisions 

by  a CFAR  algorithm  using  a 256-point  sliding  window. 

The  algorithm  computes,  for  each  decision,  an  average  of  the  previous  128  points  and  the 
next  128  points  for  use  as  a decision  threshold.  This  threshold  is  modified  by  a constant 
threshold  parameter  and  then  compared  to  the  window  center  point  resulting  in  a binary 
decision  that  the  point  is  above  the  threshold  or  is  not.  The  algorithm  is  illustrated  in 
block  diagram  form  in  Figure  7. 

The  output  of  the  algorithm  is.  then,  a single  bit  decision  tor  each  input  value  which  would 
be  forwarded  to  another  processor  that  would,  probably,  perform  some  sean-to-scan  opera- 
tion. For  the  purpose  of  efficiently  transferring  the  decisions,  the  signal  processor  is 
required  to  pack  16  consecutive  decisions  into  a 16-bit  data  word.  I his  process  will  be 
included  in  the  CFAR  benchmark. 

The  sliding  window  function  of  the  benchmark  can  be  efficiently  implemented  by  computing 
the  average  of  the  first  256  points  and  then  updating  the  average.  Aj.  as  follows 

Aj  + 1 = Ai  + Xi  + 128  ' Xi  - 128  (I1 

As  a result,  most  of  the  points  will  require  only  three  adds  tone  is  part  ot  the  compare) 
and  one  multiplication  (the  division  by  the  number  of  points  in  the  window  can  be  com- 
bined with  the  constant  threshold  parameter).  Therefore,  the  (TAR  algorithm  will  require 

3 * 4096  -f  256  = 12544  adds/millisecond 

and 

3 * 4096  = 12288  multiplies/millisecond 

in  addition  to  the  output  bit  packing. 
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2.2.4 


Cosine  Transform  Benchmark 


Ttie  cosine  transform  of  an  N point  real  sequence  of  pixel  data  is  defined  as 

r.nk  (n+1/2)"] 

xn  cos  [ n J k = °-  1 N-‘ 


N-l 

t»k  = 2 f 
n=o 


(19) 


t'tticient  computation  of  equation  (ll>)  can  be  performed  by  reworking  the  form  of  the 
equation  into  one  compatible  with  Fast  Fourier  Transform  (FFT)  techniques.  This  can  be 
done  by  rewriting  equation  ( 14)  in  the  form  of  complex  exponentials. 
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k = 0.1 N-l 


(22) 


1 he  summation  term  in  equation  (22)  is  recognized  as  the  Fourier  Transform  of  the  even 
sequence  Xn  and  is.  therefore,  a real  quantity  for  all  values  of  k.  Thus,  equation  (22)  can  be 
written  as 


it  k 

c’k  = cos  — 
2N 


kn 


/ 2N-1 

> f 

j £ \ % j 


k = 0.  1 N-l 


(21) 


1 or  the  purposes  of  image  bandwidth  compression,  the  multiplication  term  cos  i~—)  is  ol 

no  value  since  it  is  not  data  dependent  and,  therefore,  contains  no  information  of  interest. 
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The  term  in  brackets  can  be  computed  using  a 2N  length  FFT.  The  cosine  transform  of  Xn 
(less  the  multiplicative  cosine  factor)  will  be  the  first  N output  points  of  the  FFT  algorithm. 


A more  efficient  implementation  can  be  determined  by  recalling  that  Xn  has  been  made  into  an 
even  function  and,  as  a result,  the  FFT  output  will  be  real.  Consider  forming  a complex  sequence 
using  Xn  and  an  additional  set  of  new  data  YtJ  also  even  extended  as  the  imaginary  part.  Thus 

Xn=Xn+JYn  (24) 

The  Fourier  Transform  of  Xp  is 

F(Xp  ) = F (Xn)  + j F (Yn)  (25) 

because  F is  a linear  operator.  Now.  since  Xn  and  Yn  are  even  sequences,  F(Xn)  and 
F(Yn)  will  both  be  real  functions.  The  implication  of  equation  (25)  is.  therefore,  that  two 
cosine  transforms  can  be  simultaneously  computed  by  a single  double  length  FFT  and  that 
the  results  will  be  found  in  the  first  N real  FFT  outputs  for  Xn  and  the  first  N imaginary 
FFT  outputs  for  Yn. 

The  computational  portion  of  the  Cosine  Transform  benchmark  could  be  performed  using  the 
technique  described  above  on  groups  of  two  16x1  input  pixel  vectors  by  employing  a 
32  point  complex  FFT.  Input  pixel  data  will  be  no  more  than  8 bits/pixel  and.  therefore, 
there  is  no  need  for  scaling  considerations  during  a 32  point  FFT  performed  on  a 16  bit 
machine. 

The  computational  requirements  of  the  Cosine  Transform  Benchmark  can  be  further  reduced 
by  considering  the  flow  diagram  of  the  32  point  FFT  shown  in  Figure  8.  Since  only  the 
first  16  results  are  required  and  outputs  16  through  31  are  to  be  discarded  it  is  worthwhile 
to  trace  backward  through  the  flow  graph  to  determine  at  what  point  the  discard  can  actually 
take  place.  The  output  values  not  required  are  circled  in  the  (low  graph  and  the  trace  is 
shown  by  circling  all  intermediate  values  not  required  due  to  the  final  discard  process.  It  is 
seen  that  halt  ot  the  butterflies  on  all  but  the  first  course  actually  need  not  be  performed  at 
all.  Actually,  the  final  algorithm  requires  only  1 6 half  butterflies  for  a first  course  followed 
by  a 16  point  complex  FFT.  This  reduces  the  initial  count  of  80  total  butterflies  to  a count 
of  48  butterflies  for  this  modified  32  point  FFT. 

The  processing  speed  for  accomplishing  the  FFT's  in  real  time  is  as  follows: 


Number  of  pixels  transformed 

- 8064000  pixels/second 

Number  of  32  point  complex  FFTs 

8064000 

32 

= 252000  FFTs/seeond 

48  Butterflys/32  point  FFT 

48(252000) 

= 120**6000  butterfiy/seeond 

Therefore,  the  average  butterfly  rate  is  one  each  82.6  nanoseconds. 
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2.3  ELECTRONIC  WARFARE  PROBLEM 

2.3.1  Basics  on  Radar  Characteristics 

In  general,  at  the  receiver  of  any  Electronic  Warfare  equipment,  the  received  signal  can  be 
represented  as: 

y(t)  = S (t,aj,  a2 am)  + n(t)  (26) 

where  S is  the  transmitted  signal  which  is  a function  of  time  and  a number  of  parameters, 
aj,  a2...am;  and  n(t)  is  noise.  By  using  the  thoeory  of  estimation  of  paramaters,  the  m 
parameters  of  the  transmitted  signal  give  considerable  information  about  the  specific  emitter 
responsible  for  the  received  signal.  The  frequency  spectrum,  pulse  characteristics,  pulse 
repetition  frequency,  beam  pattern,  scan  pattern  and  rate,  angle  of  arrival  and  antenna 
polarization  are  some  important  characteristics  that  may  be  utilized  in  the  classification  of 
an  unknown  emitter. 

The  frequency  spectrum  includes  the  center  frequency  and  modulation  of  a CW  type  radar, 
the  center  frequency  and  pulse  modulation  of  single  or  multiple  frequency  pulsed  radars  and 
agilities.  Modulations  range  from  the  simple  rectangular  pulse  with  no  FM  to  complicated 
FM  and  coded  waveform.  In  many  cases,  the  frequency  domain  is  the  only  way  to  classify 
the  more  complex  waveforms. 

Pulse  characteristics  include  pulse  rise  time , fall  time,  amplitude  width  and  jitter.  These 
parameters  are  in  the  time-domain  and  may  be  useful  in  quick  classification  of  simpler, 
pulsed  radars  along  with  center  frequency. 

The  pulse  repetition  frequency  or  interval  is  primarily  a derived  parameter,  gotten  from  the 
sorting  of  a number  of  individual  pulse.  PRFs  fall  into  four  categories,  monofrequency, 
staggered,  jittered,  and  random,  of  which  the  first  three  can  be  definitively  classified  and 
tracked. 

The  beam  pattern,  scan  pattern  and  scan  rate  are  functions  of  the  radar  type  such  as  track- 
ing, surveillance,  height-finding,  etc.  A given  type  of  radar  will  generally  exhibit  given  beam 
and  scan  characteristics.  These  characteristics  represent  a transformation  of  frequency  and 
time  domain  information  into  a spatial  picture  of  the  radar  beam  and  the  beams  sweep 
pattern;  therefore,  these  characteristics  are  also  derived. 

Finally,  the  angle  of  arrival  is  the  angle  of  the  maximum  energy  for  the  transmitted  signal 
relative  to  the  direction  of  flight  of  the  aircraft.  The  angle  of  arrival  is  a single  pulse 
parameter  in  the  spatial  domain;  however,  to  be  most  accurate,  the  angle  of  arrival  should 
be  transformed  into  direction  of  arrival  to  compensate  for  the  in-flight  motion  of  the 
aircraft. 

2.3. 1.1  Single  Pulse  Characteristics 

Using  the  information  available  in  a single  pulse,  it  is  possible,  assuming  perfect  conditions, 
to  identify  the  emitter  type  and  location.  To  classify  the  general  type,  the  frequency 
spectrum  and  pulse  characteristics  are  the  only  useful  information  available  in  a single  pulse. 
A blanket  statement  may  be  made  that  the  measurements  of  the  parameters  in  equation  (26) 
will  be  spread  in  some  random  fashion  and  will  represent  a stochaistic  process.  If  the 
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various  frequency  and  pulse  characteristics  arc  used  to  model  emitter  classes,  then  from  the 
theory  of  estimation,  probabilities  can  be  established  using  the  estimated  mean  and  variance 
to  identify  a given  incoming  pulse  with  a modelled  emitter  class. 

For  a simple  radar,  center  frequency,  pulse  width,  and  rise  and  fall  times  are  probably 
sufficient  to  make  an  estimate.  More  complex  types  may  need  details  of  the  frequency 
spectrum  of  the  pulse  or  the  agilities  to  identify  them. 

The  location  of  an  emitter  can  be  gotten  from  the  conversion  of  the  angle  of  arrival  into 
earth  coordinates,  often  called  the  direction  of  arrival.  If  properly  done,  this  parameter  is 
the  least  sensitive  to  pulse-to-pulse  variations  because  the  emitter  can  not  move  significantly 
from  pulse-to-pulse.  The  frequency  band  and  the  angle  of  arrival  are  often  used  as  a first 
coarse  sorting  of  an  incoming  pulse. 

By  classifying  the  general  type  of  emitter  and  its  location,  the  specific  emitter  may  be 
identified  for  further  processing,  such  as  pulse  train  classification,  range,  lethality,  etc. 
Techniques  for  this  classification  will  be  discussed  in  Section  2.3.6. 

2.3. 1.2  Multiple  Pulse  Characteristics 

When  the  individual  pulses  are  successfully  classified  at  least  by  general  type,  additional 
information  can  be  extracted  about  the  emitters  responsible  for  these  pulses.  For  many 
classes  of  radar,  this  information  is  uninteresting  and  single  pulse  classification  is  enough  to 
display,  counter,  or  disregard.  However,  the  more  lethal  new  threats  have  new  agilities, 
interesting  scan  patterns,  and  various  processing  techniques  that  require  sophisticated  process- 
ing on  the  part  of  the  Flectronie  Warfare  receiver  to  characterize  the  emitter.  Pulse  train 
characteristics,  such  as  PRF,  require  a number  of  pulses  to  detect  with  any  degree  of 
accuracy  and  may  be  modelled  as  a Markov  process.  Furthermore,  scan  pattern  and  beam 
pattern  can  be  determined  using  the  pulse  train  information  with  pulse  amplitude;  however, 
extensive  processing  is  necessary.  A discussion  on  pulse  train  classification  is  given  in 
Section  2.3.6 

2.3.2  Necessary  I W Functions 

There  are  three  major  types  of  electronic  warfare  functions:  electronic  reconnaissance, 
electronic  support  measure  (FSM).  and  electronic  countermeasures  (FCM).  Electronic 
reconnaissance  is  the  specific  reconnaissance  directed  toward  the  collection  of  electromagnetic 
radiations,  e.g.  ELIN’l.  ('OM1NT,  S1CUNT,  etc.  Two  functions  are  served  by  the  iecon- 
naissanee  and  analysis  of  the  radiations:  1)  Intelligence  gathering  to  obtain  information  tor 
the  electronic  order  of  battle,  and  2)  basis  of  Ft’M  designs  or  redesigns. 

I lectronic  support  measures  are  for  monitoring  the  direction  and  type  of  potentially  hostile 
systems  generally  using  a priori  reference  data. 

Flectronie  countermeasures  is  to  deny  or  degrade  the  enemy's  use  of  his  electromagnetic 
systems  in  order  to  obtain  a tactical  advantage,  both  active  and  passive  measures  exist. 

I In  above  maim  I- W functions  are  supported  by  similar  subsystems  to  the  extent  that  the 
major  tunction  needs  them.  Ml  the  major  I W functions  need  to  be  able  to  receive  signals, 
determine  the  significant  parameters  of  the  signals,  sort  the  signals,  associate  the  signals  with 
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emitter  classes  and/or  specific  emitters.  The  ESM  and  ECM  functions  must  be  able  to 
prioritize  emitters  according  to  their  lethaligy,  and  display  the  hostile  emitters.  Finally,  the 
ECM  system  must  be  able  to  process  the  emitters,  analyze  the  available  resources  of  the 
electronic  defensive  systems,  decide  on  the  most  effective  countermeasures,  and  activate  those 
countermeasures.  The  bulk  of  this  section  will  be  devoted  to  the  sub-functions  necessary  to 
sort  signals,  associate  signals  with  classes,  and  determine  lethality  primarily  for  the  ESM/ECM 
mission. 

2.3.3  Channelized  front  End 

No  specific  system  is  defined  for  the  ensuing  benchmarks,  however,  the  availability  of  signal 
parameters  is  assumed.  The  most  promising  system  in  development  today  contains  a 
channelized  front-end  receiver,  and  the  processing  is  done  on  a channel  or  subchannel  basis. 

Typical  first  problems  are  dense  signal  environments  and  radar  characteristics  that  cover 
multibeam,  multifrequency  transmission,  PRF  agility,  RF  agility,  CW,  and  intrapulse  fre- 
quency agility.  As  a minimum,  a receiver  must  possess: 

1.  An  ability  to  handle  multiple  frequencies  simultaneously 

2.  A near-unity  probability  of  detection 

3.  Good  frequency  measurement,  resolution,  and  accuracy 

4.  Single-pulse  acquisition  and  parameter  measurement. 

To  handle  high  pulse  densities  spread  over  a wide  frequency  range  requires  a wide  instanta- 
neous bandwidth.  Furthermore,  a wide  bandwidth  allows  instant,  single  pulse  acquisition. 

The  complex  PRF  agile  radars  require  sorting  by  single  pulse  parameters,  forcing  the  need  of 
good  frequency  measurement,  resolution,  and  accuracy  and  a high  probability  of  intercept. 
The  channelized  receiver  concept  have  a wide  instantaneous  bandwidth  and  high  signal 
sensitivity,  allowing  high  probability  of  detection  over  several  octaves.  An  excellent  discus- 
sion of  channelized  receivers  and  the  impact  of  surface  acoustic  wave  devices  may  be  found 
in  Electronic  Warfare,  September/October  1977. 

2.3.4  System  Architecture 

From  a generic  point  of  view,  a fantastic  system  can  be  postulated  that  will  process  any 
incoming  pulse,  fully  characterize  it,  classify  it.  correlate  it  with  its  train  of  pulse,  and  direct 
countermeasures,  and  whatever  else  is  necessary.  From  a practical  point  of  view,  this  system 
must  handle  a multitude  of  pulses  with  exceedingly  different  characteristics.  This  section 
will  attempt  to  define  a system  flow  and  point  out  the  strengths  and  weaknesses  of  the 
various  steps  in  the  flow.  To  attempt  this  definition,  firstly,  the  top-level  system  flow  of  a 
“pulse”  processor  will  be  discussed;  secondly,  the  processing  flow  will  be  analyzed;  and 
lastly,  the  architectural  necessities  will  be  presented. 

2.3.4. 1 System  Flow 

Incoming  Signals:  The  incoming  pulse  is  received  and  generally  converted  to  base  band.  The 
various  pulse  parameters,  such  as  center  frequency,  pulse-width,  rise  time,  fall  time,  AOA. 
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etc.  are  extracted  from  the  pulse,  and  digital  words  are  passed  to  the  signal  “sorter”.  The 
digital  words,  more  or  less,  characterize  the  received  pulse. 

Sorting  of  Signals:  As  the  digital  words  enter  the  signal  processor  (“sorter”),  the  processor 
must  attempt  to  classify  the  generic  type  of  radar  so  that  the  proper  countermeasures  may 
be  chosen;  the  potential  of  danger  may  be  ascertained,  and  further  processing  may  be 
simplified.  Furthermore,  the  processor  should  identify  the  specific  emitter  so  that  any 
directivity  of  countermeasures  may  be  specified,  as  well  as.  the  probability  of  a radar  mo  te 
change  to  a dangerous  mode  may  be  estimated.  If  the  generic  type  and/or  the  specific 
emitter  can  be  classified,  then  multipulse  statistics  can  be  gathered  to  refine  further  the 
processor’s  knowledge  of  how  to  jam  and  when  this  threat  will  become  dangerous.  Finally, 
based  on  available  data,  the  processor  must  prioritize  the  threats  for  display  to  an  operator 
or  for  automatic  deployment  of  countermeasures.  In  the  airborne  1 CM  case,  this  prioriti- 
zation is  critical  because  the  ECM  gear  has  definitely  limited  quantities  of  jammer  power  or 
deployable  passive  countermeasure  to  use. 

2.3.4. 2 Processing  Flow 

Signal  Parameters:  As  the  incoming  pulse  is  being  received,  a number  of  operations  begin 
that  extract  information  about  the  pulse.  The  minimal  set  usually  includes  center  frequency 
(fc),  pulse  width  (PW),  angle  of  arrival  (AOA),  and  time  of  arrival  (TO A).  These  require 
very  little  processing  to  extract  the  information  and  may  be  handled  primarily  in  analog 
form  while  being  processed,  and  then  converted  to  digital  signals. 

However,  the  more  exotic  the  emitter  class,  the  more  sophisticated  the  processing  must 
become.  A considerable  amount  of  preprocessing  may  be  necessary  if  a good  characteriza- 
tion of  complex  radars  is  desired.  Radars  that  have  modulated  waveforms,  train  and/or 
coded  pulses,  spread  spectrum  characteristics,  or  C’W  may  require  greatly  enhanced  preprocess- 
ing on  a single-pulse  basis  to  be  successfully  characterized.  Table  3 indicates  some  of  the 
variety  that  can  be  seen  in  ladar  waveforms.  Each  type  has  favorable  properties  that  are 
useful  in  relation  to  the  range-velocity  ambiguity  function. 

lo  charactetize  these  pulses  successfully,  additional  frequency  domain  information  in  the 
torm  of  spectral  analysis,  or  additional  time  domain  information  such  as  rise  and  fall  time, 
and  pulse  amplitude  may  be  necessary  for  emitter-type  classification,  as  well  as.  a spatial 
domain  information  transformation  from  AOA  to  direction  ot  arrival  (DOA)  may  be 
necessary  for  specific-emitter  classification.  A significant  problem  or  conflict  arises  because 
the  volume  of  incoming  pulses  is  high;  therefore,  the  pre-processing  rates  for  spectral  infor- 
mation will  be  exceedingly  high,  approaching  200-500  million  operations  per  second  loi  a 
00-100  MHz  channel  at  baseband.  Unfortunately,  the  spectral  analysis  must  be  done  before 
any  emitter-type  classification  can  be  performed.  It  the  easily  classified  pulses  could  be 
stripped  away  either  fiom  the  law  analog  or  in  the  digital  data,  the  spectral  analysis  could 
be  done  on  the  spread  spectrum  pulses  at  a much  lower  processing  rate.  Current  equipment 
can  only  perform  limited  amounts  ot  tins  preprocessing  because  of  hardware  limitations 

Frobal’listic  |'\  |h  Classify  dion  11k  signal  parameters  are  passed  to  the  signal  ("sortci  * 
piocessor  toi  ilassitii.it.  and  a.  lion  idisplav  countermeasures,  etc.).  The  parameters 
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Tabic  3.  Classification  of  Radar  Waveforms 


SINGLE  PULSE 

Rectangular,  No  FM 
Spread  Spectrum 

Rectangular  - Linear  FM 

Linear  V FM 
Stepped  FM 
Quadratic  FM 

Gaussian.  Linear  FM 

TRAIN  PULSE 

Equally  Spaced,  Identical 

Equally  Spaced,  Identical  With  Constant  Complex  Multiplier 
Non-Identical 
Multiple  Frequency 
Staggered  PRF 
Multiple  Carrier 
Pseudo-Random  Coded 
Barker 

Maximum  - Length  Sequence 
Polyphase  Sequence  (Ternary,  Quaternary) 

Huffman 

CW 

Simple 

Frequency  - Modulated 
Multi-Frequency 


will  be  a best  approximation  of  the  parameter  set  actually  needed  to  characterize  the 

received  signal.  Recalling  equation  (26),  the  signal  processor  will  receive  y'(t,  bj bn)  which 

will  approximate  y,  that  is 

?(t,  b| bn)*y(t,  a, am)  (27) 

where  y"  is  the  signal  processor’s  representation  of  y and  bj,  b-»,...bn  are  the  signal  represen- 
tation parameters.  A given  emitter-type  will  have  a “characteristic”  set  of  representation 
parameters,  which  will  be  spread  in  a random  fashion  about  some  mean. 

Using  probablistic  techniques  for  multivariate  data  analysis,  a Chi-squared  method  may  be 
employed.  The  proximity  of  a given  pulse  to  the  “characteristic”  set  of  parameters  for  a 
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given  type  will  be  a measure  of  the  probability  of  identification.  If  the  measure  exceeds 
a given  threshold,  classification  will  be  declared,  and  the  probability  of  this  type  will  be 
updated.  A further  discussion  will  be  given  in  the  Section  2.3.6. 

The  threshold  criterion  is  included  to  insure  that  “over-classitying"  does  not  occur.  A pulse 
can  arrive  at  the  receiver  that  does  not  fit  into  any  a priori  emitter-type  class.  The 
processor  must  first  be  able  to  create  a new  class  it  a sufficient  statistic  can  be  formed. 
Furthermore,  the  threshold  may  be  warped  to  allow  an  easier  clasification  of  emitter-types 
that  have  exceeding  wide  diversity  or  a high  lethality.  By  establishing  a low  threshold  on 
the  more  lethal  threats,  the  receiver  processor  will  declare  more  false  classifications  ot  the 
lethal  emitter-types;  however,  this  higher-error  rate  may  be  justified  as  safety  feature  (the 
ounce  of  prevention  rather  than  the  pound  of  cure). 

Specific  Hmitter-CTassification:  A specific  emitter  can  be  identified  primarily  from  type 
classification  and  AOA/DOA.  In  the  state-of-the-art  case,  today’s  receiver  processors  use 
frequency/frequency  band  and  AOA.  This  classification  can  also  be  done  as  a Chi-squared 
distance  measure;  however,  the  statistic  for  the  AOA/DOA  will  have  to  be  built  up  as  the 
emitter-ty  pe  probability  is  created.  To  minimize  the  processing,  DOA  is  preferred  because 
it  relates  the  emitter  to  a fixed  geometric  reference  which  is  insensitive  to  the  motion  of 
the  platform.  The  establishment  of  a statistical  parameter  will  be  discussed  in  the 
Section  2.3.6. 

Furthermore,  the  probability  of  a specific  emitter  can  be  updated,  as  well  as,  a current 
“active  emitter"  list  may  be  established  and  updated.  A given  specific  emitter  may  be 
modelled  as  a Markov  process,  that  is.  it'  an  emitter  has  been  detected,  the  likelihood  is 
high  that  it  will  be  observed  again.  A scan-to-scan  correlation  may  be  observed.  The  theory 
of  Markov  chains  may  be  applied  to  Uescribe  the  scan-to-scan  correlation.  In  the  theory  ol 
Markov  processes,  the  routine  of  any  particular  event  is  not  assumed  to  be  independent  ot 
other  events.  This  theory  has  been  applied  to  radar  returns  at  the  radar  antenna  and  can  be 
likewise  applied  to  the  FW  receiver. 

Using  the  above  rationale,  multipulse  statistics  and  properties  can  be  ascertained.  Pulse  train 
chaiacteristics,  scan  pattern  and  s«.an  rate,  and  beam  width  can  be  determined.  These  ^ 
characteristics  require  sophisticated  processing  whose  difficulty  increases  as  a power  (N~  or  V'l 
of  the  number  of  items  being  processed. 

Further  statistics  may  be  gathered  it  interchannel  communication  is  permitted,  that  is.  if  a 
number  of  parallel  processors  are  assumed  to  be  handling  portions  of  the  spectrum;  multi- 
frequency  threats,  or  spread  spectrum  threats  may  appear  in  a number  of  the  channels.  I'hc 
correlation  of  interchannel  information  for  specific  classes  ot  emitters  will  provide  positive 
identification. 

2.3.5  Architectural  Necessities 

fhe  processing  flow  for  an  FW  receiver  requires  a diverse  set  ot  processing  capabilities  which 
will  be  outlined  herein  and  expanded  in  the  coding  and  discussion  in  the  Section  2.3.6.  1 he 

most  severe  are  primarily  in  the  address  generation  and  data  comparison  areas. 

I’hc  memory  organization  must  be  highly  flexible,  such  that  it  a given  frequency  band  ha.'  a 
high  degree  of  activity,  the  memory  space  can  be  reallocated  to  accommodate  the  higher 
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activity.  Two  functions  or  architectural  necessities  are  indicated:  l)  efficient  data  memory 
control  and  organization  and  2)  sophisticated  data  address  generation  for  subsequent  data 
processing.  These  necessities,  when  combined  with  the  high  pulse  densities  of  the  current 
and  future  EW  environment,  dictate  a dedicated  data  address  function  to  control  the  reading 
and  writing  of  data  into  memory  as  well  as  to  control  the  allocation  of  memory  based  on 
need. 

Data  comparisons  has  implications  in  all  areas  of  the  processor  - data  addressing,  data 
processing  and  control.  Even  though  a very  sophisticated  algorithm  can  be  devised  to 
improve  the  classification  processes,  a final  decision  must  be  made  by  the  processor.  This 
decision  can  be  done  in  all  hardware  or  in  hardware-software,  but  ultimately  it  reduces  to  a 
simple  yes/no  comparison. 

From  the  outcome  of  the  comparison,  three  types  of  branches/jumps  may  be  necessary  in 
this  process:  1)  Data  dependent  data  address  generation,  2)  Data  dependent  arithmetic 
decision,  or  3)  Conditional  or  unconditional  jumps  in  program  memory.  Because  all  types 
of  decisions  or  branches  are  best  handled  in  different  protions  of  the  processor,  the  control 
tor  these  branches  should  be  put  in  the  various  portions  rather  than  totally  centralized. 

2.3.6  Benchmarks 

This  section  will  contain  a discussion  on  three  algorithms  and  benchmarks  necessary  to 
accomplish  the  algorithms. 

Pulse  Classification,  Mean  and  Variance  Determination,  and  PRF  Sorting  algorithms  will  be 
developed  primarily  from  the  representation  of  the  received  signal  given  in  equations  (26)  and 
(27).  Certain  properties  of  random  variables  and  stochaistic  processes  will  be  included  only 
if  it  is  necessary  for  completeness.  The  benchmarks  will  be  developed  for  the  main  sections 
of  the  algorithms.  Various  processor  setup  steps  will  be  excluded  and  only  included  if 
requisite  for  understanding. 

2.3.6. 1 Representation  of  the  Received  Signal 

The  signal  processor  receives  a set  of  parameters  from  the  receiver  that  are  the  receiver's 
best  effort  to  characterize  the  incoming  pulse.  The  parameters  deviate  from  the  exact  set  of 
parameters  and  parametric  values  because  the  receiver  has  finite  capabilities  to  detect  the 
pulse,  may  not  detect  all  the  “proper”  parameters,  and  receives  a noisy,  corrupted  signal 
which  it  further  corrupts.  Using  equation  (26),  the  received  signal,  R(t),  is  represented  by  an 
N + 1 dimensional  vector  word,  which  is  transformed  by  the  function  y"  to  approximate 
R(t),  that  is. 

R(t)  =*  y(t,  bj bn)  (28) 

The  N + I dimensional  vector  word  is  considered  a pattern  vector  and  the  parameters 

b| bn  are  random  variables.  Assuming  "y(t,  bj bn)  satisfies  the  general  conditions  for  a 

ramdom  variable,  the  R(t)  is  also  3 random  variable. 
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NOTE:  In  the  signal  processor,  the  bj’s  will  be  represented  by  digital  words, 

which  means  that  all  probabilities,  discussed  henceforth,  will  be 
discrete  probabilities.  Furthermore,  it  is  assumed  that  the  b:’s  are 
independent  and  orthogonal,  that  is,  uncorrelated.  Although  this 
is  not  strictly  true,  a set  of  bj’s  can  be  chosen  where  the  bjS  approach 
being  uncorrelated. 

2.3. 6.2  Mean  and  Variance  Determination 

Algorithm:  Assuming  a complete  statistical  description  of  the  noise  at  the  receiver  is  known,  the 
joint  probability  density  function  for  the  noise  can  be  used.  The  pattern  vector  words  can  be 
represented  as: 


bj  = Sj  + Nj  (29 

where  Sj  is  the  ith  parameter  representing  the  transmitted  pulse  and  Nj  is  the  noise  on  the  ith 
parameter.  Assuming  the  noise  can  be  processed  out  or  statistically  removed,  the  job  is  to  form 
estimates  of  the  bj’s  on  the  basis  of  M observations,  y j y^  of  a given  emitter-type. 

Two  helpful  parameters  in  forming  estimates  for  the  bj’s  are  the  sample  mean  and  variance  which 
will  be  used  later  in  the  pulse  classification  algorithm. 


The  “sample  mean”  is  simply  the  sum  of  the  measurements  divided  by  the  number  of  observation 
In  terms  of  the  b’s: 


Sample  mean  of  bj  = bj  = 


bil  +bi2  + -+bN 


The  sample  mean,  bj  is  the  expected  value  of  bj.  The  “sample  variance”  is  the  measure  of 
sum  of  the  deviations  of  the  individual  observations  from  the  expected  value  divided  by  the 
number  of  observations.  In  terms  of  the  b’s. 


Sample  variance  of  bj 


(bj  j - bj)-  + ...  +(biM  - bj)‘ 
. = a • L 

1 1 M 


The  sample  variance,  a ■■■  is  the  second  moment  or  the  dispersion  of  bj. 

Benchmark:  The  mean  and  variance  for  the  various  parameteters  of  the  known  emitter- 
types  will  be  a priori  data  for  the  signal  processor.  These  values  will  be  the  distillation  ol 
intelligence  data.  Pulses,  not  meeting  the  threshold  criteria  on  AOA/DOA  statistics,  will 
have  to  build-up  a sample  mean  and  variance  for  AOA/DOA  to  permit  easy  classification.  II 
a processor  is  able  to  add  new  categories  or  emitter-types,  then  the  sample  mean  and 
variance  algorithm  or  an  equivalent  will  have  to  be  included  as  an  auxiliary  processing  task 
for  all  the  signal  parameters. 

Equation  (30)  may  be  implemented  directly  in  an  iterative  fashion  such  that  2 operations  are 
necessary  per  iteration;  however,  equation  (7)  would  require  2N-1  additions,  N multiplications 
and  a division  or  a 1/M  multiplication.  To  build  up  a statistic,  it  is  often  necessary  to 
integrate  a new  bj  as  the  data  comes. 


- *. 


A straight  implementation  would  require  3N  operations  every  new  data  point.  Assuming 
N starts  at  1 and  goes  to  N,  a total  of  3N  operations  would  be  required  for 

N-ohservations.  A different  approach  was  explored  and  it  can  be  shown  that  equation  (31) 
can  be  reduced  with  the  aid  of  equation  (30)  to 

°i2  ‘ F I bi)J-V  (32 

i-1 

By  this  reduction,  equation  (32)  can  be  used  in  an  iterative  procedure  adding  only  4 
operations  per  iteration. 

Below  is  a sample  processing  implementation  of  the  sample  mean  and  sample  variance 
described  by  equation  (30)  and  (32)  respectively.  Four  values  must  be  stored  to  set  up  the 
iteration  loop  and  a loop  counter  test  and  iteration  must  be  included  for  each  iteration. 
Therefore,  for  N observation,  8N  +4  operations  must  be  performed. 


SAMPLE  PROGRAM  FOR  MEAN  AND  VARIANCE 

SETUP  Enter  C = 0.  D = 0,  N = 1,  NMAX  = NMAX 
MEAN  C = C + B1  (N) 

BIBAR  = C/N  Comment:  Sample  Mean,  bj 

BIB2  = BIBAR*B1BAR 
D = D + B1B2 
S1GMA2  = D/M 

S1GMA2  = S1GMA2-B1B2  Comment:  Sample  Variance,  Oj“ 
IF  N = NMAX.  THEN  STOP 
N = N + 1,  JUMP  MEAN 
STOP 


2. 3. 6. 3 Pulse  Classification  Algorithm 


Algorithm:  Assuming  a sample  mean  and  variance  for  each  parameter  b:  has  been  determined 
for  J classes,  the  mean  and  variance  may  be  used  as  a measure  of  the  class,  Cjg  ag?inst  a 

new  incoming  vector  word.  A new  data  word  y is  received  with  parameters  bj bn.  To 

determine  the  probability  that  y belongs  to  class,  Cjq,  a mean-square  error  betwce^i  each  new 
bj  and  the  mean  bj^j  of  class  Cjsj  is  gotten  and  normalized  with  the  variance,  pre- 

viously established  for  C^.  This  mean-square  error  is  often  referred  to  as  a "distance 
measure".  The  error  is  given  by: 


eNi 


<bi  - 
°Ni2 


(33) 
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(Note:  bNj  represents  the  sample  mean  of  the  ith  parameter  of  class,  C'^.  Compare  with 
lv  : which  means  the  jth  observation  of  the  ith  parameter  of  an  unspecified  class).  For  the 
entire  data  word  y,  equation  (33)  becomes 

2 V (bi  " bNi* 
eN“  = > t- 

i=l  °Ni“ 

This  procedure  assumes  a multivariate  normal  distribution  for  the  vector  variable  in  each  of 
the  classes.  We  use  the  notion  of  swarm  for  the  plot  in  measurement  spece  of  points 
representing  the  members  of  a single  class.  A multivariate  normal  swarm  is  very  dense  in 
the  region  of  the  class  centroid  and  thins  out  in  all  directions.  The  normal  swarm  is  a 
hyper-ellipsoidal  distribution.  The  probability  density  function  for  the  ith  parameter  is  by 
belonging  to  the  Nth  class  is: 


P,li  oN|  vTtt  e Xn/2  (U) 

where  Xn“  represents  a Chi-squared  statistic  and  equals  *bi  ~ ^Ni"*.  Dropping  the  constant 

"Nr 

and  extending  equation  (34),  the  probability  density  function  of  y belonging  to  C^  is 

l*N  = exp(-eN-/2)  (35) 


which  also  follows  a Chi-squared  statistic.  The  probability  of  Class  Cja  is  now  represented 
by  l’jvj;  the  probability  of  all  J classes  must  be  similarly  determined.  Finally,  to  yield  the 
relative  density  of  class  ( ^ with  respect  to  y versus  the  other  J densities,  the  relative 
densities  are  determined  by: 

PrN  = 1 <-*> 


whichever  l’r  N is  the  largest  is  the  class.  However,  a threshold  function  may  be  invoked  at 
this  point  which  will  skew  the  determination.  A class  may  be  favored  because  it  represents 
a larger  threat  or  for  whatever  reason,  or  a “No  class”  determination  may  be  made.  A 
“No  class”  determination  is  valid  only  in  a processor  which  can  ileal  with  unknown  signals. 

Benchmark:  Unfortunately,  no  simple  reduction  is  available  for  this  algorithm.  The  proc 
essor  must  compute  the  mean-square  error  of  each  new  y"  versus  every  class,  C^.  The  mean- 
square  error  for  each  ith  parameter  requires  three  memory  fetches  to  get  b^:,  1 and 

°Nf 

bj  and  4 computational  steps:  a total  of  7 operations  are  involved.  With  J classes  and  m 
parameters,  a total  of  7*J*m  operations  are  necessary  to  make  the  computations  plus 
indexing  the  loop  counter  which  involves  another  J*m  operations. 


The  processor  must  calculate  the  probability  that  y belongs  to  every  class,  Cv  ami 
normalize  the  probabilities.  Assuming  the  exponential  is  a look-up  function,  1 memory 
fetch  and  2 computational  steps  are  required  for  every  class,  plus  one  memory  fetch  per  y 

for  the  £ P:;  a total  of  3J  + 1 operations  are  required. 

j=l 

Lastly,  the  processor  must  determine  which  class  has  the  highest  probability  as  well  as 
which  classes  have  passed  their  threshold.  A minimum  of  one  memory  fetch  and  two  com- 
parisons per  class  as  well  as  one  memory  store  per  y are  required,  a minimum  total  of 
3J  +1  operations  are  necessary.  A maximum  of  J-l  memory  stores  is  possible;  therefore, 
the  maximum  total  of  4J  operations  may  be  required.  Below  is  a sample  program. 

SAMPLE  PROGRAM  FOR  PULSF  CLASSIFICATION 


SETUP 

FJ  = 0.  IMAX  = IMAX,  JMAX 

PYK  = 0 

= JMAX, 

I 

= 0.  J = 

FRROR 

A = B (1)  - BBAR  (J,  I) 

Comment: 

a 

= (hj-bji) 

C = A*A 

Comment: 

c 

" <bi-bji>2 

_ (bj-V2 

D = C*  SIGN  A (J,I) 

Comment: 

d 

FJ  = FJ  +1) 

IF  1 = IMAX,  JUMP  PROB 
1 = I + I.  JUMP  FRROR 
PROB  P;  = - FJ/2  (Shift  Right) 

pj  = mem  (F) 


PYJ  = PJ*SUMPJ 

TEST  = PYJ  - PYK 
IF  TEST  < 0.  JUMP  J COUNT 


Comment; 


Comment : 


Memory  look-up  for 
exponential:  exp(F) 


Comment:  Compare  with  previous 
“high"  probability 


IF  TEST  = 0.  JUMP  J STORE 


PYK  = PYJ 


Comment:  Replace  with  new 
"high"  probability 


J STORE  MEM  = J Comment:  Store  Cj 

J COUNT  IF  J = JMAX.  JUMP  TURFS 
J = J + 1,  JUMP  FRROR 

TEST  = PYK  - T Comment: 

IF  TEST  > 0,  JUMP  STORE 


HIRES 


Compare  with 
threshold 


STORE  At  this  point,  the  program  will  option  to  display 
the  output  (MEM),  to  determine  a counter 
measure  or  do  nothing  if  the  test  is  passed. 
Otherwise,  it  will  store  the  y for  further  processing. 


2.3. 6.4  PRF  Sorting 

Algorithm:  as  the  incoming  signal  is  received,  a time-of-arrival  number  or  word  is  associated 
with  it.  This  TOA  word  is  relative  to  an  internal  clock  and  demarks  the  beginning  of  the 
pulse.  The  primary  purpose  for  the  demarcation  of  the  pulse  is  to  develop  multipulse 
statistics  like  the  PRF/PR1  of  a specific  emitter.  By  knowing  the  PRI  and  the  time  of  arrival 
of  the  previous  pulse,  an  ECM  processor  can  anticipate  its  needs  for  expenditure  of  counter- 
measure resources. 

The  major  problem  for  PRF  sorting  is  multiple  emitters  of  the  same  or  similar  type  trans- 
mitting in  close  geometric  proximity  such  that  the  AOAs  cannot  be  resolved.  The  goal  of  the 
sorter  is  to  pull  apart  the  distinctive  PRFs,  either  simple,  staggered  or  jittered.  The  algorithmic 
tlow  is  exceedingly  simple;  however,  as  the  number  of  pulses  to  be  sorted  increases,  the  prob- 
lem can  become  untenable. 

The  How  is,  as  follows: 

1.  Calculate  the  difference  between  all  reasonable  TOA  combinations,  that  is. 

Aij  = TOAi-  TOA'  for  all  j i 

where  TOAi  represents  the  TOA  of  the  ith  pulse. 

2.  Compare  the  differences  for  a repetitive  pattern,  such  as: 

Aij  = Ajk  = Akl  = Aim  = . . . 

A tolerance  must  be  included  in  this  comparison  so  that  the  comparison  is  not 
overly  sensitive  to  noise. 

3.  After  a successful  comparison  of  a given  Aij.  a PRF  can  be  declared  and  utilized. 
Utilization  may  range  from  simply  preparing  the  countermeasure  to  developing 
histograms  for  beam  width  and  scan  pattern  determination. 

Benchmark:  The  benchmark  described  herein  represents  a “practical”  approach  to  using  the 
lOAs  as  they  come  to  the  PRF  sorter.  As  the  processor  receives  the  mth  TOA.  it  stores  tin 
data  in  memory,  replacing  an  old  TOA  value.  This  approach  represents  a moving  time  window 
over  which  processing  will  be  performed.  Without  this  constraint,  the  processor  would  be 
saturated  within  a very  lew  pulses.  With  small  modification,  this  benchmark  could  be  used  as 
a batch  process  in  which  a large  number  of  TOAs  are  saved  and  processed  as  a group. 

lhe  processor  must  update  the  memory  pointer  and  fetch  an  “old"  TOA  lor  delta  ealeu 
lation.  The  delta  calculation  is  performed  N times,  where  N represents  the  average  number  ol 
pulses  received  during  the  time  window,  lwo  operations  are  required  per  pass. 
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After  the  delta  is  calculated,  it  is  immediately  compared  with  the  deltas  present  in  a given 
A row,  that  is,  if  the  new  delta  has  subscripts  i and  j,  this  A will  only  be  compared  with 
deltas  having  subscripts  n and  i.  This  comparison  property  is  shown  in  Figure  9.  If  a match 
is  declared,  the  Au  is  stored  and  a “hit  counter”  is  updated.  The  "hit  counter”  represents 
the  number  of  TOAs  in  a row  that  have  had  an  equal  TOA  difference  (A).  When  the  hit 
counter  exceeds  a given  value,  the  PRF  is  declared.  This  comparison  requires  N(N+I) 
comparisons  each  containing  one  or  more  operations. 

Below  is  a sample  delta  calculation  and  comparison  program.  Significant  program  development 
is  required  to  include  how  the  bit  counter  is  incremented  or  decremented,  how  a PRF  is 
declared,  and  how  the  data  is  used  for  prediction. 

A new  instruction  has  evolved  from  this  benchmark  the  windowed  compare.  Because  the 
use  of  absolute  compare  function  would  create  a noise-sensitive  process,  a tolerance  must  be 
included  to  account  for  variations  in  the  TOA  measurements  and  the  subsequent  delta 
calculation. 


SAMPLK  PROGRAM  FOR  DFLTA  CALCULATION  AND  COMPARISON 


SI-TUP  ENTER  N = N.  M = M.  I = 
M-N-: 

J = l-N-1,  TOL  = TOL 
TOA(M)  = TOA 


Comment:  N is  the  average  number 
of  pulses  received  during 
the  time  window.  M is 
the  memory  pointer. 

Comment:  Store  the  new  TOA 


DELTA  DELTA  (I.M)  = TOA(M)-TOA(l) 
IF  l = M.  STOP 
1 = 1+1 


COMP 


TEST  1 = DELTA  (I.M)  - 

Comment:  Comparison  windows 

DELTA(J,l)+TOL 

have  been  set-up 

J = J + 1 

will  be  replaced  by  new 

IF  J = M-l,  JUMP  DELTA 

instruction. 

TEST  2 = TEST  I - 2 *TOL 
IF  TEST  I > TOL.  JUMP  COMP 
IF  TEST  2 < -TOL.  JUMP  DELTA 

IF  TEST  I < TOL  + TEST  2 > - TOL.  THEN  Increment  the  hit 
counter; 

Store  Aim 
Jump  to  COMP. 
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SECTION  111 


MULTIMODE  CPU  ARCHITECTURE 


3.0  INTRODUCTION 

In  Section  II,  a baseline  scenario  was  defined  for  future  signal  processing  applications.  The 
scenario  was  presented  as  a set  of  representative  benchmarks.  The  benchmarks  were  chosen 
from  previous  Air  Force  procurements  and  in-house  experience  and  are  used  to  indicate  the 
various  processing  and  control  structure  necessary  to  properly  handle  the  problem  set. 

Table  4 is  a brief  compendium  of  the  benchmarks  and  the  data  processing,  data  addressing, 
and  control  structures  necessary  to  perform  the  benchmarks. 

In  this  section,  an  attempt  will  be  made  to  utilize  Table  4 and  discuss  the  impact  of  the 
processing  needs  on  basic  computing  structures  such  as  the  control  section,  the  ALU.  the 
data  addressing  and  the  bus  system. 

3.1  PRIMITIVE  COMPUTING  STRUCTURE 

Conceptually,  the  most  basic  computing  structure  must  contain  a control  function,  an  arith- 
me  tic /logic  function,  and  storage.  All  structures  may  be  broken  down  to  these  fundamental 
structures.  For  the  purpose  of  discussion.  Figure  10  represents  a primitive  computing  struc- 
ture for  handling  signal  processing.  The  control  function  is  handled  by  an  addressing  unit 
and  a micro-program/instruction  memory.  That  memory  controls  the  functioning  ol  the 
arithmetics  and  storage,  as  well  as,  its  own  addressing  unit,  thereby  creating  a self-contained 
computer. 


The  arithmetic  function  is  performed  by  the  Register  Arithmetic  Logic  Unit  (RALU)  and  the 
multiplier.  The  RALU  performs  all  the  basic  arithmetics;  add.  subtract,  shift,  and  the  basic 
logic  functions,  AND,  OR,  EXOR,  COMPLIMENT.  The  multiplier  performs  a simple  hard 
wired  multiply  function  on  any  two  operands  presented  to  it.  The  multiplier  is  an  extension 
of  the  basic  arithmetic  function  because  the  multiply  function  is  generally  required  in  signal 
processing. 

The  storage  function  (operand  storage)  is  handled  by  the  data  memory  and  the  register 
section  of  the  RALU.  The  data  memory  has  both  permanent  operand  storage  (i.e.,  ROM 
PROM)  and  temporary  storage  (i.e.,  RAM).  The  structure  shown  in  Figure  10  assumes  that 
the  addressing  of  operands  (data  addressing)  is  performed  by  the  RALU  or  the  controller. 

Although  Figure  10  shows  a multitude  of  buses,  a single  bus  can  be  conceived  to  handle  all 
control  and  data  informational  transfers  The  bus  structure  will  be  discussed  at  length  in  ttie 
next  section. 

This  primitive  computing  structure  has  been  presented  as  a basis  lor  the  following  discussions. 
These  discussions  will  expand  the  description  ol  the  elements  in  Figure  10.  as  well  as.  give 
the  rationale  tor  the  specific  embodiments  of  the  elements  based  on  the  baseline  scenario  ami 
architectural  constraints. 
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Table  4.  Benchmarks  and  Indicated  Architectural  Characteristics 


Benchmark 

Data  Processing 

Data  Addressing 

Control 

FFT 

Multiply  Accumulate 
Complex  Arithmetic 

Array  Indexing 

Loop  Counting 

Coordinate  Conversion 

Double  Precision 

Numerical  Scaling 

Tight  Data  Loops 

Data  Dependent 
Branches 

CFAR 

Bit  Manipulation 

Simple  Addressing 

Data  Dependent 
Jumps 

Cosine  Transform 

High  I/O  Rates 

To  Memory  and 

Outside  World 

High  Addressing 

Rates 

j 

Loop  Counting 

Pulse  Classification 

Memory  Table  Lookup 

Variable  Bit  Length 
Data  Words 

High  Speed  Arithmetic 

Array  Indexing 

Data  Dependent 
Branches 

3.2  BUS  SPEED.  WIDTH,  EFFICIENCY 

In  viewing  the  signal  processing  problem  from  a system  point-of-view,  it  becomes  apparent  for 
certain  problems,  such  as  the  FFT  and  pulse  characterization,  that  bus  traffic  considerations 
are  paramount.  For  this  reason,  the  design  of  the  Multimode  CPU  began  from  the  bus  and 
proceeded  out.  This  section  and  the  section  on  multiplier  structures  will  hopefully  justify 
this  decision,  as  well  as,  detail  the  structures  dictated  by  the  problem  set. 

3.2.1  Bus  Speed 

The  FFT  requires  a great  number  of  data  memory  reads  and  writes  to  accomplish  the  butter- 
fly operation.  Because  the  speed  of  operations  is  also  quite  high,  the  path  in  time  from  the 
generation  of  the  read/write  address  for  the  data  memory  until  the  data  reaches  its  destination 
or  arrives  from  its  source  must  be  minimized.  In  viewing  this  requirement,  a single  bus  for 
data  addresses  and  data  would  become  extremely  difficult  to  manage,  considering  the  high 
data  flow  required.  It  has  been  concluded  that  this  path  from  address  to  data  must  be 
pipelined  to  provide  maximum  speed;  therefore,  a separate  data  address  bus  and  a separate 
data  bus  is  a necessity  to  handle  the  pipeline. 

Furthermore,  a minimum  time  path  can  be  analyzed,  as  in  Figure  11.  which  will  give  feasible 
estimates  of  the  time  to  take  a previously  generated  address  from  the  address  register  to  the 
memory,  to  fetch  the  operand  from  the  data  memory  and  to  send  operand  to  the  data 
register.  The  time  path,  therefore,  is 

T = TREG  OUT  + TDRIVER  + TADR  BUS  + TACC  + TDATA  BUS  + TLATCH. 
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Figure  1 1 . Minimum  Time  Path  for  Address  to  Register 

With  current  technology,  the  minimum  time  path  can  be  from  50  to  100  n sec,  depending  on 
a number  of  factors,  including  IC  drive,  PC  board  techniques,  memory  selection,  etc. 

Using  the  above  argument  for  separate  data  and  data  address  buses,  it  has  been  concluded 
that  a five  bus  system  is  necessary  to  maintain  and  support  the  data  bus  and  data  address 
bus  requirements. 

The  five  buses  are: 

DATA 

DATA  ADDRLSS 
INSTRUCTION 
INSTRUCTION  ADDRKSS 
STATUS  FLAGS  OF  PROCFSSORS 

Analysis  shows  that  each  of  these  buses  must  maintain  a speed  equivalent  with  the  speeds  of 
the  data  and  data  address  bus.  that  is.  the  instruction  address  to  microinstruction  memory  to 
instruction  register  path  must  be  as  quick  as  the  data  address/data  path. 

3.2.2  Bus  Width 

As  stated  before,  the  FFT  presents  the  most  challenging  problem.  This  extends  into  the  area 
of  bus  width.  The  FFT  butterfly  requires  two  or  three  complex  data  reads  and  two  complex 
data  writes  be  performed.  Obviously,  the  bus  could  be  structured  so  that  the  complex  words 
are  accessed  as  real  quantities  (2  per  complex  word).  Such  a bus  would  double  the  number 
of  reads  and  writes  necessary  to  accomplish  an  FFT  butterfly,  thereby  doubling  the  time  to 
set  up  the  FFT  independent  of  the  multiplier. 

The  indicated  conclusion  is  that  a dual  data  bus  system  should  be  used  so  that  a single  read 
time  is  necessary  to  access  and  transmit  a complex  data  word.  Furthermore,  the  data  words 
should  be  lb  bit  real  and  lb  bit  imaginary  to  allow  processing  gain  without  scaling  which 
would  require  either  additional  processing  steps  or  more  hardware.  Therefore,  the  data  bus 
will  be  32  bits  wide  to  handle  the  complex  data  for  the  FFT.  This  si/e  is  also  good  if  an> 
double  precision  arithmetic  is  necessary.  Coordinate  conversion  routines  sometimes  require 
expanded  accuracy  for  positional  fixes. 


i 
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3.2.3  Bus  Efficiency 

Bus  efficiency  is  a motherhood  topic.  The  maximum  efficiency  is  100  percent.  Any  signal 
processor  should  strive  for  the  maximum,  especially  during  the  FFT  processes.  By  the  FFT's 
very  nature,  a 50  percent  efficiency  is  a practical  upper  bound,  that  is.  2 or  3 data  reads, 
some  processing  time,  and  2 data  writes.  The  processing  time  is  generally  equivalent  to  the 
sum  of  read  and  write  times  if  the  processor  is  properly  organized.  A cursory  conclusion  at 
this  point  is  if  a practical  upper  bound  of  50  percent  efficiency  is  obtainable,  why  not  have 
two  FFT  butterfly  operations  running  concurrently  and  out-of-phase  so  that  one  is  processing 
during  the  reads  and  writes  of  the  other  and  vice  versa?  Thus,  the  bus  efficiency  can 
approach  the  maximum. 

3.3  MULTIPLIER  STRUCTURES 

This  section  on  multipliers  has  been  included  to  discuss  the  impact  of  a multiplier  special 
function  unit  on  the  speed  and  bus  traffic  of  a signal  processor.  The  multipliers  described 
herein  will  be  assumed  to  have  lb  x lb  bit  multiply  capability  and  may  be  any  of  a number 
of  available  multiplier  organizations,  such  as  parallel,  pipelined,  or  serial  parallel. 

lhe  problem  set  will  be  those  discussed  heretofore;  however,  the  FFT  remains  the  most 
challenging  problem.  The  actual  design  of  the  multiplier  will  not  be  included  although  its 
implementation  greatly  impacts  the  LSl-ability  of  the  multiplier  special  function  unit. 

3.3.1  FFT  Butterfly 

To  accomplish  an  FFT  butterfly,  the  signal  processor  and  its  special  function  unit  must  fetch 
two  complex  data  points  tand  possible  a complex  rotation  vector),  perform  a complex  multi- 
ply and  two  complex  adds,  and  store  two  complex  data  points.  Figure  12  shows  the  actual 
operation  of  the  butterfly. 

However,  to  perform  the  complex  operations  described  above,  the  current  processors  must 
perform  all  real  operations.  The  complex  multiply  becomes  four  real  multiplies  and  two  real 
adds,  and  the  complex  adds  become  two  real  adds  each.  Thus,  the  optimum  structure  to 
perform  the  FFT  butterfly  would  have  four  parallel  real  multiplier  and  two  a'al  adders  per- 
forming the  complex  multiply  , and  four  real  adders  performing  the  complex  adds  (.see 
Figure  13). 

All  signal  processors  must  emulate  the  FFT  butterfly  structure  in  Figure  13,  either  by  furn- 
ishing all  the  hardware,  by  recursive  use  of  a single  multiplier,  or  the  software.  Assuming 
that  the  purely  software  method  would  be  both  clumsy  and  slow,  only  the  first  two  methods 
will  be  discussed.  Four  multiplier  structures  will  be  discussed  as  means  of  accomplishing  the 
the  FFT  butterfly. 

3.3.2  Multiplier  FFT  Structures 

lhe  simplest  structure  is  a single  multiplier  with  two  input  latches  to  latch  in  the  lb-bit 
operands,  a lb  x lb  multiplier,  and  two  lb-bit  output  latches  to  hold  a di  uble  precision 
product.  This  multiplier  could  be  constructed  from  the  AMD  2\4  multiplier  chips  or 
the  TRW  lb  x lb  multiplier  chip. 
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An  advancement  from  the  simple  structure  is  the  addition  of  an  accumulator  that  can  handle 
addition  of  two  32-bit  products;  thereby  performing  the  real  or  imaginary  portion  of  the 
complex  multiply.  This  structure  is  called  a multiply/accumulator.  Currently,  a version  is 
available  from  TRW  that  can  handle  12  x 12  bit  multiplication.  Although  the  TRW  product 
is  insufficient,  it  is  a step  in  the  right  direction. 

A further  advancement  would  include  the  holding  registers  necessary  to  perform  the  whole 
complex  multiply  without  multiple  operand  fetches  and  stores.  Figure  14  shows  such  a 
structure.  The  rotation  vector  and  data  point  can  be  loaded  directly  into  input  latches;  the 
four  multiplies  can  be  pipelined  through  the  single  multiplier;  the  partial  products  can  be 
accumulated  and  held  in  latches;  and  the  complex  product  can  be  outputted  in  a single  clock 
time.  No  present  product  is  known  that  can  accomplish  this  structure  on  a single  chip; 
however,  the  Raytheon  Micro-Signal  Processor’s  pipeline  structure  virtually  performs  this 
operation. 

l ire  final  advancement  would  be  structured  that  totally  emulated  the  FFT  butterfly  structure 
in  Figure  13.  The  only  difference  would  be  that  various  registers  would  be  necessary  to  hold 
A,  B,  and  W.  as  well  as,  intermediate  results.  This  structure  is  a totally  hardware  approach; 
therefore,  the  unit  would  be  a special  purpose  processor,  even  though  standard  multipliers 
could  be  performed  without  any  penalty. 

3.3.3  System  Impact  of  Multiplier/FFT  Structures 

Fach  of  the  structures  discussed  above  will  be  analyzed  herein  with  regard  to  their  impact  on 
bus  efficiency  and  speed.  As  discussed  in  Section  3.2.3,  a goal  of  a processor  should  be  100 
percent  bus  efficiency;  however,  this  efficiency  concept  must  be  extended  to  include  a state- 
ment about  the  types  of  bus  traffic.  Obviously,  the  bus  can  be  filled  with  partial  products 
and  incompleted  solutions  (that  is,  shuffling  intermediate  data  around  to  accomplish  a task), 
or  the  bus  can  be  filled  with  operands  and  solutions.  The  latter  case  indicates  a higher 
“true”  efficiency  of  the  bus.  and  is  a function  of  whether  the  complex  data  is  transferred 
simultaneously  or  sequentially  in  the  case  of  the  FFT. 

Heuristically,  if  the  data  is  transferred  as  complex  words,  a 1024  FFT  will  require  1024 
input  transfers  and  1024  output  transfers;  that  is,  2N  transfer  times  for  N points.  However, 
if  the  data  is  transferred  as  real  words  (the  complex  word  is  treated  as  two  real  quantities), 
the  same  FFT  will  require  2048  input  transfers  and  2048  output  transfers;  that  is,  4N  transfer 
times  for  N points.  The  following  discussion  will  take  this  heuristic  argument  and  analyze  the 
specifics  of  each  multiplier/FFT  structure.  For  this  discussion.  Figure  15  will  be  considered 
the  system  architecture. 

3.3.3. 1 Case  l;  Multiplier 

To  perform  the  butterfly,  a single  multiplier  will  be  used  for  the  real  multiplies,  and  the 
RAl.U’s  will  be  used  as  the  adder/accumulators.  Since  the  multiplication  requires  two 
operands  be  presented  every  multiply  cycle  by  the  RALU’s  or  the  data  memories,  the  real 
and  imaginary  buses  (16  bits  each)  are  tied  up  for  loading  and  each  output  ties  up  the  real 
bus. 

In  addition  to  the  bus  traffic  to  load  and  unload  the  multiplier,  two  or  three  complex  reads 
are  necessary  to  set  up  the  butterfly  by  putting  the  operands  into  holding  register,  that  is. 
the  registers  on  the  RALU.  Finally,  two  complex  writes  are  necessary  to  store  the  output  ol 
the  butterfly. 
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Table  5 has  been  included  to  estimate  the  bus  activity  in  clock  cycles.  The  multiply  time  is 
assumed  to  be  one  or  two  clock  cycles.  From  the  cycle  totals,  the  bus  is  busy  about  70 
percent  of  the  time;  however,  two-thirds  of  the  bus  activity  is  shuffling  data  to  and  from  the 
multiplier.  Furthermore,  overlapping  of  butterflies  would  be  virtually  impossible;  therefore,  the 
bus  and  multiplier  must  remain  unused  during  part  of  the  cycle.  It  is  concluded  that  such  a 
system  would  be  inefficient  in  performing  the  FFT  butterfly. 


Table  5.  Bus  Activity  as  a Function  of  the  Multiplier/FFT  Structure 


Bus  Activity  in 

Clock  Cycles 

Case  1 

Case 

2 

Case  3 

Case  4 

Operation  Being 

Bus 

Bus 

Bus 

Bus 

Bus 

Bus 

Bus 

Performed 

Active 

Free 

Active 

Free 

Active 

Free 

Active 

Complex  Opei.md  Reads 

2/3 

- 

2/3 

* 

* 

■ 

MPY  Input 

4 

4 

1/2 

2/3 

MPY  Operation 

- 

4/8 

4/8 

It 

4/8 

1/2 

MPY  Output 

4 

- 

2 

1 

Intermediate  Adds  for 
Complex  MPY 

- 

2 

- 

- 

- 

Complex  Adds  and  Word 
Writes 

2 

*• 

2 

2 

- 

Total  Per  Column 

12/13 

6/10 

10/11 

4/8 

5/6 

4/8 

4/5 

3/4 

Total  Cycles  per  Case 

18/19  to  22/23 

14/15  to  18/19 

8/9  to  12/13 

7/8  to  8/9 

•Complex  words  go  directly  to  Multiplier 

t Additional  complex  read  during  Multiply  operation  (does  not  increase  total  cycles) 


3.3.3. 2 Case  2:  Multiply/Accumulator 

As  in  Case  1,  a single  multiplier  is  employed,  and  the  RALU  is  used  as  the  operand  holding  registers; 
however,  the  intermediate  adds  necessary  to  complete  one-half  of  the  complex  multiply  are  done  in 
the  accumulator. 

Once  again,  the  bus  traffic  is  split  between  operand  fetching  and  multiplier  loading  and  unloading. 
As  indicated  in  Table  5.  the  bus  is  busy  about  60  percent  to  70  percent  of  the  time  and  60  percent 
of  the  bus  activity  is  the  movement  of  operands  to  and  from  the  multiplier.  Overlapping  of  butter- 
flies would  again  be  quite  difficult,  and  the  bus  and  multiplier  have  idle  time.  Although  the 
accumulator  with  the  multiplier  is  an  improvement  over  a simple  multiplier,  this  system  is  still 
inefficient  in  performing  the  butterfly. 
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3. 3. 3. 3 Case  3:  Multiplier/ Accumulator  with  Holding  Registers 

A single  multiplier  is  used;  however,  the  operand  holding  registers  and  accumulators  are 
included  so  that  the  complete  complex  multiply  can  be  done  without  intermediate  data  being 
placed  on  the  data  bus.  The  complex  multiplier  and  multiplicand  go  directly  to  the  multiplier 
holding  registers,  and  the  third  complex  word  goes  to  the  RALU’s  during  the  multiplier 
operation,  thereby  not  increasing  the  total  time  to  perform  the  butterfly. 

Except  for  the  movement  of  the  complex  product  from  the  multiplier  to  the  RALU  so  that 
the  two  complex  adds  can  be  done  to  finish  the  butterfly,  all  of  the  bus  traffic  is  the  fetching 
and  storing  of  data  in  the  data  memories.  The  bus  is  active  approximately  50  percent  of  the 
time;  therefore,  if  two  multiplier  units  of  this  type  were  employed,  the  overlapping  of  butter- 
flies could  be  accomplished,  resulting  in  approximately  100  percent  bus  efficiency.  Using  the 
overlapping  process,  the  multipliers  could  be  kept  busy  full-time. 

This  approach  to  the  multiplier  special  function  unit  is  a significant  improvement  over  both 
cases  1 and  2.  This  sytein  would  be  quite  efficient  in  performing  the  butterfly  operation. 

3. 3. 3.4  Case  4:  Multiplier/FFT 

Multiple  multipliers  are  used,  all  holding  register  for  the  three  complex  operands  are  in  the 
unit,  and  all  accumulators  are  included.  Essentially,  the  rotation  vector  and  the  two  complex 
operands  are  directly  loaded  into  the  multiplier  unit,  the  complex  multiply  is  performed,  the 
two  complex  adds  are  performed,  and  the  outputs  are  loaded  back  into  the  data  memory. 

All  the  data  bus  activity  is  dedicated  to  loading  and  storing  operands.  The  bus  is  active  50 
percent  of  the  time,  and  as  in  case  3,  100  percent  efficiency  could  be  accomplished  by  over- 
lapping in  time  if  two  multiplier  units  were  employed. 

Obviously,  this  approach  represents  the  most  efficient  approach  to  performing  the  butterfly; 
however,  this  efficiency  can  only  be  accomplished  with  dedicated  hardware.  The  system 
design  must  ultimately  decide  between  the  minimal  differences  between  the  performance  of 
the  units  in  case  3 and  case  4. 
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3.4  COMPLEX  PROCESSOR 


A detailed  review  of  the  multiplier  special  function  unit  cases  has  revealed  that  the  more 
sophisticated  units  - the  multiplier/accumulator  with  holding  registers  and  the  multiplier/FFT 
- can  permit  extensions/modifications  to  the  RALU  structures  that  will  impact  the  data 
addressing  function.  Although  the  modifications  are  minor,  two  different  complex  pro- 
cessors have  been  identified  - one  if  the  simpler  multiplier  units  must  be  used,  the  other 
if  the  more  complex  units  are  available. 

This  section  has  been  included  to  describe  the  processor  architectures  from  a fairly  high  level. 
Within  this  section,  the  first  complex  processor  to  be  discussed  will  have  a multiplier  or 
multiplier/accumulator,  and  the  second  will  have  the  more  sophisticated  multiplier  functions. 

3.4.1  Processor  1 (See  Figure  16) 

3.4. 1.1  Data  Processors 

This  processor  uses  two  real  processors,  or,  more  appropriately,  RALU’s  to  perform  the  com- 
plex arithmetic  dictated  by  the  problem  set.  Each  real  data  processor  is  a 16  bit  RALU,  able 
to  perform  arithmetics,  logicals,  etc.  in  a single  instruction  cycle.  Therefore,  the  two  real 
processors  can  perform  the  full  complex  add  or  subtract  function  in  a single  instruction  if  they 
are  worked  in  tandem. 

3.4. 1.2  Data  Addresser 

The  data  addresser  is  a single  16-bit  processor  RALU  which  must  be  able  to  add,  subtract, 
increment  and  compare.  In  the  configuration  shown,  the  addresser  can  furnish  two  16-bit 
addresses  per  clock  cycle  to  the  data  memories;  however,  only  one  new  address  can  be  calcu- 
lated during  that  period.  This  calculation  limitation  is  not  a hinderance  for  the  problem  set 
herein  discussed.  A third  port  on  the  data  addresser  is  tied  to  one  of  the  data  buses  so  that 
a data  word  may  be  used  as  a data  address  such  as  in  the  case  of  a ROM  table  look-up. 

3.4. 1.3  Data  Memories 

The  data  memories  will  include  both  temporary  ami  permanent  storage,  i.e.  RAM  and  ROM. 

To  support  complex  processing,  one  memory  will  be  for  the  real  operands;  the  other,  for  the 
imaginary  operands. 

3.4. 1.4  Multiplier 

The  input  latches  are  connected  as  shown  in  Figure  16.  Because  the  complex  multiply  requires 
a multiplication  of  two  real  operands  and  two  imaginary  operands,  the  crisscrossing  of  the 
“real”  bus  to  the  imaginary  processor  and  vice  versa  is  necessary.  The  crisscrossing  is  also 
desirable  if  the  processor  is  to  be  used  in  an  array  fashion. 

The  output  latches  hold  the  product  of  the  input  words  until  desired.  The  most  significant 
bits  are  latched  in  the  C Latch  and  attached  to  the  real  bus.  This  latch  is  the  only  one  used 
in  most  cases.  When  double  precision  products  are  necessary,  the  D latch  holds  the  least 
significant  bits  and  is  attached  to  the  imaginary  bus,  thus,  the  imaginary  bus  becomes  the 
lower  bits  bus  when  double  precision  arithmetic  are  being  performed. 
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All  the  latches,  both  input  and  .output,  are  assumed  to  be  independently  latched. 
3.4. 1.5  Control 


The  control  structure,  as  shown  in  Figure  17.  contains  a single  Instruction  Addressing  Unit 
(which  will  be  described  in  a subsequent  section)  that  addresses  the  microinstruction  memory. 
The  microinstruction  memory  has  a total  of  109  control  bits  and  a maximum  of  4096  words. 
The  control  bits  are,  as  follows: 

25  Bits  to  control  Real  Processor 
25  Bits  to  control  Imaginary  Processor 
25  Bits  to  control  Data  Addresser 
4 Bits  to  control  Multiply  Function 
4 Bits  to  control  Real  Data  Memory 
4 Bits  to  control  Imaginary  Data  Memory 
12  Bits  for  Jump/Branch  Addresses 
10  Bits  for  Next  Microinstruction  Control 
109  Total 

This  control  word  is  exceptionally  wide:  however,  the  system  designer  must  make  a compro- 
mise at  this  point.  The  total  number  of  bits  to  control  the  processors,  etc.  cannot  be  lowered, 
but  the  number  of  microinstruction  bits  can  be  significantly  reduced.  Reducing  the  number 
of  microinstruction  bits,  simply  means  that  a high  degree  of  decoding  must  be  accomplished 
either  within  the  processor  or  in  an  external  ROM/PROM.  The  decoding  operation  takes  time. 
The  decision  must  be  based  on  the  time  available.  If  speed  is  the  goal,  then  the  amount  of 
decoding  must  be  minimal.  Thus,  the  control  section  here  has  opted  for  speed. 

Because  the  microinstruction  word  is  extremely  wide,  it  is  assumed  that  the  microinstruction 
register  is  part  of  each  function  being  controlled,  i.e.,  the  control  registers  are  within  the  data 
processor,  etc. 

The  Instruction  Addressing  (IA)  unit  contains  the  Hag  logic  that  is  necessary.  There  are  three 
sources  of  status  or  flag  returns  in  the  complex  processor  - the  real  and  imaginary  processors 
and  the  data  addresses.  Each  of  these  processors  can  return  four  bits;  this  may  be  a problem 
for  the  Hag  logic  provided  on  the  I A unit.  Expansion  may  be  necessary  in  some  cases;  how- 
ever. this  is  unlikely  for  the  given  problem  set. 

The  control  unit  must  be  able  to  furnish  microinstructions  to  the  data  processors  and  data 
addresser  at  the  minimum  instruction  completion  rate.  Since  these  functions  have  been  defined 
earlier  in  this  section  as  having  single  clock  instructions,  the  control  unit  must  be  able  to 
supply  instructions  every  clock  cycle.  If  that  instruction  rate  can  be  maintained,  then  no 
instruction  buffer  or  FIFO  register  is  necessary  or  even  desirable.  The  buffer  or  FIFO  causes 
problems  in  algorithms  with  a high  degree  of  jumps  such  as  the  pulse  classification  task. 

Before  a jump  or  branch  can  be  accomplished,  the  FIFO  must  be  cleared,  or  a fast-address- 
around  loop  must  be  included. 
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3.4.2  Processor  II  (See  Figure  18) 

3.4.2. 1 Data  Processor/Addresser  (DP/DA) 


By  providing  a more  sophisticated  multiplier  special  function  unit,  the  need  for  a separate  data 
addressing  unit  is  obviated  because  the  ALU  of  the  data  processor  is  virtually  unused  during 
the  FFT  butterfly  operation.  The  remaining  problems  in  the  benchmarks  are  much  less  diffi- 
cult or  strenuous  from  the  ALU’s  point-of-view.  In  fact,  the  remaining  problems  require  only 
one  RALII  or  data  processor  and  one  data  addresser. 

The  data  processor/data  addresser  is  explained  in  depth  in  a subsequent  section.  The  struc- 
ture is  essentially  the  same  as  the  data  processor  from  Processor  I with  circuitry  added  to 
perform  data  address  incrementing  and  with  an  additional  data  address  register  and  port. 


The  dual  function  DP/DA  is  able  to  perform  processing  functions  such  as  complex  add  or 
subtract  and  increment  an  address  simultaneously  or  to  calculate  and  furnish  two  lb  bit  addresses 
every  clock  cycle.  Furthermore,  because  the  DP/DA  functions  share  the  same  register  stack, 
there  exists  and  intrinsic  ability  to  transfer  data  to  the  address  port  for  a ROM  table  look-up. 


3.4.2  2 Data  Memory 


. 

■ 
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Same  as  Processor  1. 


3.4.2.3  Multiplier/FFT 


The  multiplier  for  this  processor  is  capable  of  performing  fully  complex  arithmetic  as  well  as 
real  and  double  precision  arithmetic. 

3.4.?  4 Control 

The  principles  of  operation  are  exactly  the  same  as  for  Processor  1.  Figure  17  represents  the 
control  structure.  The  only  variance  is  the  specified  use  of  the  control  bits  and  the  total  num- 
ber of  bits  necessary'.  The  control  bits  are  used  as  follows: 

26  Bits  to  control  Real  DP/DA 
26  Bits  to  control  Imaginary  DP/DA 
10  Bits  to  control  MPY/FFT 
4 Bits  to  control  Real  Data  Memory 
4 Bits  to  control  Imaginary  Data  Memory 
12  Bits  for  Jump/Branch  Address 
J0_Bits  for  Next  Microinstruction  Control 
92  Total 

3.4.3  Complex  Processor  Performance 

The  two  complex  processors,  discussed  herein,  were  analyzed  in  depth  to  determine  their  per- 
formance. The  FFT  butterfly  and  the  pulse  classification  benchmarks  were  chosen  for  the 
analysis  because  they  represent  the  most  strenuous  problems  in  the  baseline  problem  scenario. 
The  FFT  is  extremely  orderly  in  its  instruction  flow  where  the  arithmetic  operations  are  a 
preponderance  of  the  problem.  The  pulse  classification  benchmark  represents  a repetition  of 
arithmetics,  but,  more  importantly,  it  contains  a high  degree  of  conditional  and  unconditional 
jumps,  which  is  a good  test  of  the  flexibility  of  the  control  structure. 

Appendix  A contains  the  equations  and/or  task  flow  of  the  algorithms  and  the  coding  and 
timing  of  the  two  processor.  The  summary  is  given  below. 

Processor  1 requires  17  clock  cycles  to  perform  the  butterfly  (19  if  the  rotation  vector  must 
be  loaded);  therefore,  89086  clock  cycles  are  necessary  to  do  a 1024  point  FFT.  Processor 
II  with  a single  multiply/FFT  unit  requires  either  eight  or  nine  effective  clock  cycles  per 
butterfly;  thereby,  needing  less  than  half  the  number  of  clock  cycles-41987.  Only  20  per- 
cent more  cycles  are  necessary  if  the  dual  multiplier/accumulators  with  holding  registers  are 
employed. 

Both  processors  performed  equally  well  on  the  pulse  classification  benchmark.  This  benchmark 
requires  seven  cycles  for  setup  and  three  for  close-out  (i.e.,  thresholding)  and  6051  for  class- 
ification. The  total  is  6061  clock  cycles.  Processor  1 potentially  has  an  advantage  in  per- 
forming pulse  classification  because  it  has  two  data  processors  and  an  independent  data 
address;  however,  the  dual  DP’s  are  not  an  advantage  unless  dual  control  sections  can  be  pro- 
vided for  testing  and  program  control.  Such  a structure  would  simply  become  two  parallel 
processors.  Both  processors  can  perform  the  benchmarks  as  demonstrated.  The  speed  advantage 
of  Processor  II  is  purely  a result  of  additional  hardware,  which  is  probably  justified  in  the 
case  of  the  FFT  butterfly. 
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DATA  PROCESSOR/DATA  ADDRESSER 
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Within  this  section,  a full  description  of  the  Data  Processor  (DP)  and  Data  Addresser  (DA) 
architectures,  as  well  as,  introductory  words  on  the  design  rationale  for  the  general  structure 
will  he  included.  These  structures  will  not  be  discussed  as  integrated  circuits  although  some 
reference  may  be  included  if  a design  rationale  is  only  clear  in  the  1C  context.  Specific  1C 
trade  otfs  will  be  in  the  technology  section  of  this  report. 


3.5.1  Design  Rationale 

Early  in  the  design  effort,  it  was  noted  that  similar  structures  for  the  DP  and  DA  functions 
could  be  employed  it  the  multiplier/FFT  structure  was  considered  independent  of  the  Data 
Processor.  Each  function,  DP  or  DA,  has  a need  for  a number  of  high-speed,  on-chip 
registers  and  an  Arithmetric/Logic  Unit  (ALU)  structure.  Because  the  general  structures 
were  similar,  a more  detailed  look  was  warranted.  Below  is  a capsule  of  the  register  and 
ALU  needs  of  each  function. 

3.5. 1.1  Registers 

The  Data  Addresser,  described  as  part  of  the  complex  processor  section,  is  a highly  utilized 
(unction  requiring  the  same  high  speed  that  the  DP  requires.  The  problem  set  forces  the 
signal  processor  to  address  data  operands  at  a very  high  rate;  therefore,  it  is  incumbent  on 
the  processor  to  calculate  its  data  addresses  quickly,  forcing  the  need  for  on-chip  registers. 

The  registers  must  store  the  current  address  of  the  operand  being  fetched,  the  starting 
address  ot  the  operand  string  to  be  utilized,  the  maximum  or  ending  address  of  the  operand 
string,  and  the  incremental  values  or  delta  addresses.  An  incremental  value  is  used  to 
determine  the  steps  through  the  operand  string,  and  there  may  be  need  for  more  than  one 
incremental  value  it  the  addressing  is  complex.  To  further  complicate  the  problem,  if 
double  indexing  or  higher  indexing  is  advantageous,  register  space  is  necessary  for  all  the 
start,  maximum,  current  and  delta  addresses.  To  satisty  the  double  index  need,  a minimum 
of  8 is  dictated,  and  16  registers  would  be  nice. 


The  register  needs  ot  the  DP  are  very  straight  forward.  Operand  storage,  intermediate  results 
storage,  and.  depending  on  the  multiplier  special  function  unit,  multiplier  operand  storage. 

In  every  algorithm  coded  to  date,  the  maximum  number  of  registers  utilized  has  never 
exceeded  six,  even  with  the  most  inefficient  multiplier  structure. 

A final  comment  on  the  registers  is  necessary.  The  registers  should  be  Multiport  RAM  with 
two  read-ports  and  one  or  two  write-ports  depending  on  the  multiplier  for  ease  of  operand 
fetch  and  storage  in  the  registers. 


3.5. 1.2  Arithmetic/ Logic  Unit 

The  DA  function  requires  only  the  most  basic  arithmetics  to  be  able  to  complete  its  tasks 


addition 

subtraction 

increment/decrement  ( + 1) 
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The  ability  to  test  addresses  for  the  maximum  must  be  available.  The  test  requires  simple 
subtraction  and  the  generation  of  a test  flag  to  force  a jump/branch,  or  decrement  in  the 
loop  counter. 

The  DP  must  have  a sophisticated  ALU  with  full  arithmetics,  logicals,  and  shifts.  The  gen- 
eration of  test  Hags  for  data  dependent  operations  and  signals  for  carry  generate  and  prop- 
agate for  extended  precision  arithmetic  must  be  available. 

3.5. 1.3  RALU  Structure 

The  conclusion  drawn  from  the  above  discussion  is  that  a RALU  structure  is  indicated.  The 
DA  function  forces  the  highest  need  for  registers,  and  the  DP  requires  the  more  sophisticated 
ALU;  however,  neither  requirement  forces  an  untenable  deviation  from  the  needs  of  the  other 
function.  If  anything,  the  RALU  is  a slight  overkill  for  each  function. 

3.5. 1.4  Additional  Comments 

As  discussed  in  Section  3.4,  the  RALU  structure  for  the  DP/DA  will  be  controlled  by  a wide 
instruction  word  with  little  or  no  decoding  on  the  chip.  This  constraint  has  been  applied 
because  it  offers  the  highest  speed  and  maximum  flexibility  in  the  timing  cycles,  thereby, 
allowing  fast  single  cycles  and  multiple  cycles  if  necessary. 

To  minimize  the  total  chip  count  of  the  signal  processor,  the  instruction  words  are  latched 
onto  the  chip  and  held  in  instruction  registers.  In  other  words,  no  external  registers  are 
needed  for  instructions.  The  rationale  is  simply  that  external  registers  aie  inefficient 
because  their  low  gate-to-pin  ratio  requires  many  additional  chips.  By  placing  them  on  the 
RALU,  the  number  of  I/O  pins  on  the  RALU  is  unaffected,  and  the  gate  count  is  only 
slightly  increased. 

Finally,  all  the  ports  are  latched  and  tristated  to  minimize  external  multiplexers.  Since  the 
data  bus/data  address  bus  are  system  limiters,  it  was  concluded  that  the  fewer  the  number  of 
multiplexer,  the  faster  the  bus  could  operate. 

3.5.2  Two  DP/DA  Structures 

Depending  on  the  multiplier  special  function  unit,  variations  in  the  specific  DP/DA  archi- 
tecture are  indicated.  The  simpler  multiply  functions,  discussed  in  the  complex  processor, 
required  a whole  unit  be  dedicated  to  data  addressing.  Each  DP  and  DA  unit  has  the  RALU 
structure  described  above,  that  is,  a full  function  ALU,  a three  Port  MPR  and  3 Bidirec- 
tional I/O  ports  (see  Figure  14). 

The  more  sophisticated  multiply  functions,  also  discussed  earlier,  minimizes  the  use  of  the 
ALU  in  the  DP;  thus,  the  functions  of  DP  and  DA  can  be  combined  because  the  DP/DA  is 
used  for  addressing  and  calculating  addresses  during  the  FFT  butterfly.  By  combining  the 
functions,  additional  features  are  necessary  on  the  DP/DA  to  support  the  addressing  when 
the  ALU  & I/O  ports  are  being  used  during  processing.  An  address  incrementer  with 
increment/decrement  and  pass  capability  and  an  additional  unidirectional  port  for  addressing 
must  be  added.  Furthermore,  the  MPR  requires  an  additional  write  port  so  that  addresses 
may  be  incremented  and  written  back  into  the  MPR  simultaneously  with  data  being 
processed  in  the  ALU  and  being  written  back  in  the  MPR  (see  Figure  20).  Only  in  this 
case  is  a full  four-Port  MPR  required. 
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Figure  20.  Processor  li  RALU 


3.6 


INSTRUCTION  ADDRESSING 


The  control  function  for  the  complex  processor  consists  of  a microinstruction  memory  and 
an  instruction  addresser.  The  instruction  addresser  (1A)  includes  a microsequencer,  a loop 
counter,  an  interrupt  control  unit,  and  flag  logic.  The  I A will  furnish  a 12  bit  address  to 
the  microinstruction  memory  which  will  control  the  DP’s  and  DA,  set  up  to  the  IA  for 
the  next  microinstruction,  and  provide  the  jump/branch  addresses.  The  next  sections  will  be 
devoted  to  explain  the  1A  architecture. 

3.6.1  Program  Control  Unit 

The  program  control  unit  (PCU)  is  indigenous  to  all  stored  program  computers  and  is  often 
called  the  microsequencer  (a  la  2909).  The  PCU  is  shown  in  Figure  21.  The  heart  of  the 
PCU  is  the  address  multiplexer  and  register.  Under  the  control  of  the  IA  instruction  register, 
the  flag  logic,  and  the  interrupt  logic,  the  address  MUX  acts  as  a “traffic  cop",  selecting  the 
next  microinstruction  address  from  4 sources: 

1.  Program  Counter 

2.  LIFO  Stack 

3.  External  Input 

4.  Interrupt  Address. 

The  program  counter  generally  contains  the  “next  address”  in  its  register.  During  normal 
operation,  the  program  counter  simply  is  incremented  by  1 and  steps  through  program.  The 
output  of  address  multiplexer  is  increment  (actually  +1,  +2.  or  pass)  and  stored  in  the  pro- 
gram counter.  When  a branch  operation  is  being  initiated,  the  program  counter  contents 
are  fed  to  the  LIFO  Stack  as  the  branch  return  address. 

The  LIFO  (last  in-first  out)  stage  is  a group  of  registers  that  are  12  bits  wide  which  generally 
store  the  branch  return  address.  The  LIFO  is  a RAM  when  a branch  return  is  necessary. 

The  external  is  used  as  a way  of  “forcing"  a jump  or  branch  instruction  address  and  is  chosen 
only  when  the  address  multiplexer  receives  the  correct  condition  select  codes  and/or  test  con- 
ditions from  the  flag  logic  and  instruction  register. 


The  interrupt  address  is  a hard  wired  address  which  is  furnished  by  the  interrupt  control  unit 
(1CU).  The  ICU  will  be  discussed  later. 


3.6.2  Interrupt  Control  Unit 

The  interrupt  control  unit  (see  Figure  21)  contains  the  priority  interrupt  unit  to  establish 
the  relative  priority  of  interrupts  as  received  and  the  control  interface  to  control  the  inter- 
rupt requests,  thereby  allowing  disruption  only  when  desirable. 


The  priority  interrupt  unit  has  the  interrupt  register  for  reception  of  the  interrupts.  The 
interrupt  register  feeds  the  interrupt  logic  to  determine  the  priority  of  the  interrupt,  as  well 
as  to  provide  the  interrupt  address  to  the  address  mux.  The  highest  priority  interrupt  may 
be  considered  a DMA  controller  which  can  affect  a memory  store  via  the  control  interface 
without  interrupting  the  data  processor. 
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Instruction  Addressing  Function 


The  control  interface  handles  interrupt  requests  and  provides  control  to  the  priority  interrupt 
unit.  It  is  controlled  by  the  auxiliary  flags  and  the  high  priority  interrupt  line  which  is 
reserved  for  DMA  loading  from  the  system  I/O.  The  final  function  of  the  control  interface 
is  response  to  the  higher  level  processors  in  an  array  configuration.  In  other  words,  the  higher 
order  control  and  response  is  handled  by  the  control  interface  for  array  coordination. 

3.6.3  Flag  Logic 

The  flag  logic  (see  Figure  21)  is  necessary  so  that  test  flags  may  be  used  to  control  the  next 
address  given  to  the  microinstruction  memory.  The  external  test  flags  include  the  carry  bit, 
overflow,  sign  and  ALU  equals  zero  received  from  any  cc  mbination  of  the  DP’s  and  DA. 
Furthermore,  the  loop  counter  provides  a zero  indication  which  may  be  used  to  stop  a “DO” 
loop.  (Further  discussion  will  be  given  in  the  loop  counter  section.)  Auxiliary  flags  have 
been  included  to  extend  the  limited  input  (seven  flags)  to  the  flag  logic. 

The  flag  logic  and  auxiliary  flags  control  the  loop  counter’s  incrementer,  the  external  flag  test 
and  the  interrupt  control  unit;  however,  that  control  can  partially  be  modified  by  the  condi- 
tion selects  conditions  furnished  by  the  next  instruction  control  word  sent  to  the  1A  instruc- 
tion register  from  the  microinstruction  memory. 

3.6.4  Loop  Counter 

The  loop  counter  (see  Figure  21)  provides  a simple  way  to  control  the  looping  of  repetitive 
routines,  and  it  represents  the  only  departure  from  the  very  fundamental  control  provided  in 
most  basic  microprogrammable  processors.  The  loop  counter  receives  a littral  from  the 
external  input  which  sets  up  the  loop  count.  Each  clock  cycle,  depending  on  the  flag  test 
and  the  IA  instruction  register,  the  count  is  either  decremented  or  passed  from  the  counter 
output  to  the  counter  input  undisturbed;  therefore,  at  the  end  of  each  pass  through  a routine, 
the  beginning  instruction  of  the  routine  is  addressed  and  the  loop  count  is  decremented. 

When  the  count  is  zero  and  the  end  of  the  routine  is  reached,  the  loop  is  ended. 

This  structure  is  quite  simple  and  could  be  replicated  any  number  of  times  to  allow  auto- 
matic control  of  nested  loops  as  in  the  FFT  algorithm,  pulse  classification  algorithm  and 
any  algorithm  that  requires  a number  of  passes  through  a fixed  routine.  In  the  current 
structure  only  one  has  been  included  because  LSI  gate  count  constrains  the  number  of  loop 
counters  that  are  advisable. 

3.7  ARRAY  PROCESSING 

Either  complex  processor,  discussed  earlier,  is  suitable  for  us  as  an  array  processing  element 
or  controller  in  a parallel  array  multiprocessor.  The  rationale  for  array  processing  is  simply 
to  have  a number  of  computers  applied  to  a single  task;  thereby,  multiplying  the  computation 
power.  The  multiple  computer  systems  may  be  divided  into  two  classes:  (I)  Single  Instruction 
Stream/Multiple  Data  Stream  (SIMD)  systems,  referred  to  as  parallel  processors  and  (2)  Multiple 
Instruction  Stream/Multiple  Data  Stream  (M1MD)  systems,  called  multiprocessors.  Historically, 
signal  processing  problems  have  been  proposed  for  parallel  processors;  however,  data  dependent 
algorithms,  such  as  associative  search  and  pulse  classification,  are  extremely  difficult. 

Multiprocessors  systems  have  a collection  of  relatively  independent  processors  sharing  a com- 
mon memory  and  set  of  I/O  devices.  The  processors  must  contend  for  access  to  the  memory 
and  I/O  which  makes  the  multiprocessor  architecture  slow  for  signal  processing  tasks,  requiring 
high  I/O  rates.  The  ability  to  easily  share  data  operands  is  a desirable  feature  of  the  multi- 
processor systems. 
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By  approaching  the  array  processor  problem  from  the  point-of-view  of  the  signal  processing 
problem  set,  the  parallel  processor  architecture  with  a limited  ability  to  share  and  pass  data 
between  nearest  neighbor  processors  is  highly  desirable.  Heretofore,  such  an  approach 
would  have  been  limited  by  the  sheer  bulk  of  the  array  elements;  however,  current  LSI 
technology  affords  new  potential  for  a “mixed”  approach. 

The  complex  processors,  shown  in  Figures  16  and  18  show  that  one  data  I/O  port  of  each 
the  real  processor  and  the  imaginary  processor  is  removed  from  the  data  bus  of  the  complex 
processor  and  freed  for  use  in  data  transfer  to  the  nearest  neighbor  array  elements. 

A port  is  also  made  availanle  for  data  flow  from  a control  processor  element  via  the  broad- 
cast bus.  Processor  I requires  that  the  broadcast  bus  be  tied  to  the  data  buses  to  permit 
the  proper  data  flow  during  multiplier  operation.  Processor  11  is  able  to  free  up  the  ports 
of  the  processors;  therefore,  the  broadcast  bus  does  not  need  to  fight  for  contention  with 
an  internal  array  processor  data  bus. 

3.7.1  Array  Processor  Flement 

A system  ot  four  array  processor  elements  and  a control  element  is  shown  in  Figure  22  to 
represent  a parallel  array  multiprocessor.  One  processor  acts  as  a controller  to  this  system, 
and  the  remaining  four  are  configured  as  two  16-bit  K ALU’s  which  provide  arithmetic  and 
logic  capability  tor  the  processor.  Associated  with  each  RALU  is  a data  memory  consisting 
ot  both  PROM  and  RAM.  Each  RALU  is  responsible  for  addressing  its  own  memory.  The 
RAM  provides  a total  of  IK  32-bit  words  of  storage  for  dynamic  data,  while  the  PROM 
holds  512  32-bit  constants  used  in  performing  the  FFT  algorithm. 

Each  ot  the  RALU  s is  independent  ot  the  other  on  that  they  may  perform  ditferent  instruc- 
tions. This  allows  efficient  complex  number  arithmetic  to  be  performed.  In  executing 
algorithms  involving  complex  values,  real  numbers  are  stored  in  one  data  memory  and  imag- 
inary numbers  in  the  other.  A patli  is  provided  between  the  RALU’s  to  allow  transfer 
ot  data.  Each  ot  the  RALUs  provides  one  1 6-bit  bidirectional  bus  to  a neighboring  array 
processor  so  that  interprocessor  data  transfers  may  take  place.  The  real  RALU  provides  a 
connection  to  the  higher-order  16  bits  of  the  system  broadcast  bus.  The  lower-order  lb  bits 
are  connected  via  transceivers  to  the  imaginary  RALUs  memory  data  bus.  The  bus  trans- 
ceivers are  controlled  by  a one-bit  field  in  the  microinstruction  memory. 

The  multipliers  are  connected  in  parallel  and  have  a bidirectional  port  to  each  memory. 

Their  operation  is  alternated  by  the  microcode  which  controls  them.  This  is  necessary 
because  they  are  fully  independent  circuits,  and  it  is  fruitless  to  attempt  to  load  or  empty 
them  simultaneously.  The  capabilities  of  the  multipliers  include  the  following:  multiply, 
multiply  accumulate,  anil  butterfly  in  both  real  and  complex  formats;  double  precision 
scalar  multiply. 

The  microinstruction  memory  supplies  all  instruction  fields  to  the  processor  hardware.  The 
lact  that  all  elements  of  hardware  can  be  controlled  by  a single  microinstruction  makes  the 
array  processor  a horizontally  micro-coded  machine.  This  enhances  its  speed  and  makes 
each  instruction  very  powerful. 
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Figure  22.  Parallel  Array  Multiprocessor 

The  processor’s  speed  is  enhanced  further  by  the  fact  that  the  RALUs  and  the  multipliers 
contain  instruction  registers  that  allow  instruction  fetch  and  execution  to  be  overlapped. 

The  microinstruction  memory  is  addressed  by  the  program  counter  which  is  located  in  the 
controller.  The  microinstruction  memory  supplies  a literal  field  to  the  controller  which  is 
generally  used  as  a branch  address.  An  alternate  branch  address  can  be  determined  from 
data  received  trom  the  controller  via  broadcast  bus.  This  is  the  mechanism  by  which  the 
array  processor  can  receive  task  assignments  from  the  control  processor.  The  controller  has 
flag  testing  logic  onboard  and  accepts  up  to  eight  flags  from  the  RALU  and  multiplier  chips. 
A total  of  12  flags  are  available  from  the  devices,  however,  so  an  FPLA  should  be  used  to 
combine  some  of  the  flags.  The  FPLA  logic  is  controlled  by  a microcode  field  from  the 
control  PROM. 

A specialized  control  interface  is  incorporated  into  the  controller.  The  control  interface  is 
connected  directly  to  the  array  control  buses  shown  in  Figure  23.  The  interface  logic  is 
illustrated  in  some  detail  in  the  IA  discussion. 

3.7.2  Parallel  Array  Multiprocessor 

The  efficiency  of  uniting  the  array  processors  to  perform  parallel  tasks  is  dependent  on  their 
ability  to  operate  synchronously.  For  this  reason,  all  processors  in  the  system  operate  from 
the  same  clock  source.  If  they  were  not  synchronized,  complex  and  time  consuming  soft- 
ware routines  would  be  required  for  intercommunication,  and  hardware  would  have  to  be 
provided  to  accomplish  handshaking. 
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The  broadcast  bus  is  used  both  lor  issuing  commands  to  the  array  elements  and  for  getting 
data  into  and  out  of  the  array.  Its  dual  use  is  made  practical  by  the  fact  that  task  initiation 
ties  up  the  bus  only  for  the  amount  of  time  required  to  issue  a single  program  address  to 
the  responding  array  elements  - one  clock  cycle.  Efficient  control  of  the  processors  in  the 
array  depends  upon  a mechanism  for  selectively  issuing  commands  to  the  array  elements 
and  for  determining  their  program  status.  The  control  structure  indicated  by  Figure  23 
allows  this  to  be  done. 


The  “instruction”  control  signal  identifies  whether  or  not  the  broadcast  bus  currently  con- 
tains an  instruction.  For  a processor  element  to  accept  an  instruction  from  the  bus,  it 
must  first  be  in  a state  of  attention,  either  by  having  ended  a previous  task  segment  or  by 
way  of  interrupt  from  the  controller.  The  “interrupt”  signal  is  used  by  the  control  pro- 
cessor to  issue  interrupts  to  the  array.  The  control  processor  is  able  to  determine  which 
elements  of  the  array  are  in  a state  of  attention  by  means  of  a general  purpose  flag  register 
which  resides  in  each  of  the  array  processors.  The  controller  may  simultaneously  sample 
the  flag  registers  of  the  array  elements  by  means  of  the  “response”  signal  which  is  available 
from  each  element  as  shown  in  Figure  23.  The  flag  registers  contain  a number  of  flags  and 
any  of  them  can  be  gated  to  the  response  bus  by  way  of  the  “condition  select”  lines. 

The  controller  accepts  a single  interrupt  from  the  array.  The  interrupt  line  is  daisy-chained 
throughout  the  array  elements,  and  the  assignment  of  priority  is  established  by  the  way  in 
which  the  chain  is  routed.  As  it  is  necessary  for  the  control  processor  to  determine  the 
source  of  an  interrupt,  each  array  processor’s  flag  register  includes  the  interrupt  flag. 
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The  response  logic  described  only  operates  when  the  controller  is  not  issuing  a command  to 
the  array  (i.e.,  when  the  “instruction"  signal  is  not  asserted),  so  the  controller  cannot  simulta- 
neously examine  flags  and  issue  commands;  as  a practical  matter,  this  is  not  a handicap.  The 
reason  for  this  is  that  the  flag  inspection  logic  has  a dual  use.  Both  instructions  and  inter- 
rupts to  the  array  can  be  made  conditional,  so  that  it  is  possible  to  selectively  apply  them  to 
the  array.  The  response  logic  is  instrumental  to  this  purpose.  The  “condition  select"  lines 
control  the  condition  by  which  each  array  processor  determines  whether  or  not  an  instruction 
or  interrupt  is  intended  for  it. 

One  of  the  condition  codes  corresponds  to  “unconditional”,  that  is,  it  specifies  every  element 
of  the  array.  This  is  used  when  the  entire  array  is  to  perform  a parallel  task.  All  but  one 
of  the  remaining  condition  select  codes  specifies  one  of  the  flags  in  the  array  processors’ 
flag  registers;  the  “true/false”  signal  establishes  whether  the  specified  condition  is  the  true  or 
false  state  of  the  flag.  It  is  thus  possible  to  selectively  issue  commands  to  elements  which, 
through  previous  program  tasks,  have  set  flags.  The  remaining  condition  code  allows  the 
controller  to  use  the  response  bus  to  specify  which  array  elements  are  selected;  for  this 
reason  the  response  bus  is  bidirectional.  The  controller  may  then  pick  the  responding 
elements  by  asserting  the  response  line  to  which  each  is  tied. 

The  control  mechanisms  described  are  extremely  flexible  and  account  for  the  ability  to 
efficiently  use  the  system  in  botli  parallel  processor  and  multiprocessor  modes. 
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SECTION  IV 


LSI  TECHNOLOGY  SUMMARY 


4.0  INTRODUCTION 

The  status  of  LSI  development  is  an  everchanging  scene.  For  a time,  a given  technology  or 
several  technologies  will  reign  supreme  in  the  marketplace,  only  to  give  way  to  new  tech- 
nologies or  improved  versions  of  the  older  technologies.  This  point  not  withstanding,  an 
attempt  must  be  made  to  gather  enough  information  about  the  state-of-the-art  to  determine 
whether  a particular  function  is  feasible  as  an  LSI  chip  or  must  be  made  with  a limited  num- 
ber of  chips  and  chip  types. 

A survey  of  the  technology  has  been  made  to  get  a rough  picture  of  the  present  status  of 
LSI  technologies.  From  this  survey,  an  attempt  has  been  made  to  extract  a list  of  macro- 

constraints  which  an  LSI  function  can  not  exceed  today  or  in  the  next  one  to  two  years.  ' 

Included  as  the  final  section  of  this  chapter  are  some  methods  and  methodology  for  LSI 
development. 

4.1  TECHNOLOGY  SURVEY 

This  technology  survey  gives  the  present  status  of  the  technologies  available  for  both  custom 
LSI  and  memories.  The  current  research  in  LSI  technologies  is  to  satisfy  demands  for 
greater  function  in  the  microprocessor  area  (custom  LSD  and  higher  density  and  greater  speed 
in  all  types  of  memories.  The  developments  are  related  to  economics:  increased  density, 
lower  speed-power  factors,  larger  wafers,  and  improved  yield.  The  discussion  will  be  separated 
into  a section  on  custom  LSI  and  on  LSI/VLSI  memories. 

4.1.1  Custom  LSI 

The  major  characteristics  of  the  current  technologies  that  are  available  for  custom  LSI/VLSI 
applications  are  included  in  a number  of  brief  description  and  are  summarized  in  Table  6. 

Table  6 is  an  attempt  to  take  the  sometimes  ambiguous  data  for  the  various  technologies  and 
alter  it  to  some  standard  definition  or  measurement  procedure;  therefore,  a description  of  the 
Table  is  included  at  the  end  of  the  technology  discussions. 

4. 1.1. 1 SiGate  MNOS 

The  N-channel  MOS  uses  the  ion-implantation.  SiGate  and  doped  oxide  technology,  with  a 
100  crystal  orientation  process.  The  N-channel  device  with  higher  electron  mobility  and  low 
threshold  voltage  means  faster  operation  while  using  less  power.  At  higher  substrate  doping,  it 
allows  the  channel  to  be  shorter,  resulting  in  reduced  input  capacitance  and  reduced  size.  With 
its  low  power,  high  mobility,  and  packing  density,  NMOS.  i.e.  compatible  and  even  desirable 
for  custom  LSI  technology. 
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4. 1.1. 2 N -Channel  Depletion-Enhancement  Mode  SOS-MOSFET 

The  NMOS/SOS  evolved  out  of  the  conventional  bulk  SiCate  NMOS  approach  where  it  is 
fabricated  on  the  insulating  sapphire  substrate.  The  advantage  over  the  bulk  NMOS  is 
observed  by  virtually  eliminating  the  parasitic  capacitance  and  by  increasing  surface  carrier 
mobility  which  gives  maximum  current  for  a given  geometry. 

4. 1.1. 3 VMOS 


The  N-channel  V-MOS  transistor  is  formed  along  the  slope  of  the  V-groove  which  is 
anisotropically  etched  into  the  surface  of  a silicon  wafer.  The  process  is  a double-diffusion 
profile  in  the  channel  region  under  the  gate,  which  effectively  reduces  the  channel  region  to 
a micrometer.  Compared  with  NMOS,  VMOS  technology  saves  about  40 7c  in  random  logic- 
area  and  lower  speed-power  product.  This  advantage  makes  VMOS  attractive  to  be  used  in  a 
broad  range  of  memory  devices. 

4. 1.1. 4 DMOS 


The  pDnar-doubJe-dit fused  MOS  exhibits  a short-channel  characteristic  which  are  obtained  from 
a full-size  device  The  channel  length  is  determined  by  the  difference  in  lateral  diffusion  of 
two  profdes.  Effective  channel  lengths  of  less  than  1 urn  can  be  obtained  independent  of  the 
photolithographic  tolerances  which  limit  channel  length  for  conventional  MOS  fabrication.  It 
appears  that  its  performance  advantage  over  a conventionally  scale  down  device  may  be  too 
small  to  make  it  worth  considering  at  this  time. 

4. 1.1. 5  C~L/MOS 


'J 

The  CL  is  a self-aligned  silicon-gate  CMOS  technology  where  the  gate  completely  surrounds 
the  dram  providing  a transistor  aspect-ratio  which  maximizes  the  transconductancc-to- 
capacitance  ratio  thus  allowing  high  speed  on-chip.  The  C2L  device  exhibits  a factor  of  3 
improvement  in  packing  density  over  standard  CMOS  and  operates  at  frequency  approximately 
4 times  faster  than  standard  CMOS.  Thjj  OL  device  requires  6 photomasks,  one  less  than 
standard  CMOS.  In  regard  to  custom  OL  LSI  design,  the  only  known  source  is  not  inter- 
csteu  unless  the  volume  is  high  (million  units  per  year). 


4. 1.1. 6  CMOS/SOS 


The  SOS/MOS  technology  evolved  out  of  the  conventional  bulk-silicon  approach.  The  silicon- 
on- sapphire  (SOS)  approach  comes  closest  to  these  desirable  features  of  high-speed  perfor- 
in'10; at  low  supply  voltages  and  with  nanowatts  of  stand-by  power  dissipation.  The  MOS/ 
SOS  devices  can  be  fabricated  in  a thin  single  crystal  silicon  film  grown  on  the  insulating 
sapphire  substrate.  The  use  of  thin-film  silicon  virtually  eliminates  the  parasitic  capacitance 
which  gives  the  highest  speed  with  minimum  power  and  circuit  complexity.  In  addition 
having  the  non-junction  type  isolation,  it  will  improve  its  transient  radiation  resistance  char- 
acteristics. Availability  has  been  a consistent  problem;  however,  for  this  technology. 

4. 1.1. 7 First  and  Second  Generation  I2L/MTL 

first  generation,  integrated  injection  logic  (I2L)  or  merged  transistor  logic  (MTL)  is  basically 
derived  from  direct  couple  transistor  logic  which  utilizes  a basic  four-mask,  double  diffused 
bipolar  process  without  junction  isolation.  Second  generation  I-L/MTL  gate  is  fabricated  with 


a new  process/structure,  which  includes  a matrix  on  the  P+  extrinsic  base  drive  and  implanted 
intrinsic  base  dose  for  the  n-p-n  transistor.  While  retaining  the  advantage  of  the  first  gene- 
ration, it  is  designed  to  operate  at  greater  speed  with  same  injector  current.  The  1-L  promises 
to  plan  an  important  role  in  LSI  technology. 

4.1. 1.8  S2L 

The  S~L  has  a structure,  topology,  and  characterization  of  integrated  injection  logic  with  a 
self-aligned  double-diffused  injector.  The  new  structure,  a lateral  p-n-p  transistor  with  effec- 
tive si^bmicron  base  width,  can  be  realized,  by  using  standard  photolithographic  techniques. 
The  S~L  with  higher  injector  efficiency  and  low  parasitic  capacitance  results  in  a large  fan- 
out capability,  high  speed  and  large  noise  margin.  The  packing  densities  are  improved  by 
factor  of  2 over  standard  1-L  logic. 

4. 1.1. 9 SFL 


The  substrate  fed  logic  uses  an  approach  designed  primarily  for  LSI  where  high  packing 
density  and  low  power-delay  is  desired.  The  basic  logic  element  is  a multi-input,  multi-output 
gate,  formed  in  a single-base  area  by  using  several  diffused  collectors  and  several  Schottky 
base  contacts.  It  has  been  found  that  an  overall  improvement  of  2.2  in  packing  density 
between  SFL  and  1-L  technologies  with  the  same  tolerances  can  be  obtained.  It  was  noticed 
at  maximum  speed.  SFL  power  dissipation  is  equal  to  standard  1-L  logic. 

4.1.1.10  SCHOTTKY  12L 


Schottky  1-L  is  a modified  form  of  the  substrate  fed  logic,  differing  from  the  earlier  process 
in  the  extrinsic  n-p-n  base  profile.  Heavier  boron  doping  in  this  region  has  lead  to  less  charge 
storage  so  that  minimum  delay  and  power  are  reduced.  The  high  performance  of  Schottky 
l-L  has  been  achieved  with  a structure  designed  for  high  yield  by  use  of  simple  processing 
technique. 

4.1.1.11  Up-Diffused  12L 

The  “up-diffused”  structure  is  fabricated  in  a fashion  that  Schottky  diodes  can  be  readily 
incorporated.  With  the  addition  of  Schottky  clamps  between  the  collector  and  base  of  the 
n-p-n  switching  transistor,  gate  delay  by  factor  of  5 and  power-delay  product  by  factor  of  2 
is  achieved  over  standard  I~L.  Another  version  is  injected  Schottky  logic  (1SL)  currently 
under  development  by  Signetics. 

4.1.1.12  I3L 

The  lsoplanar  integrated  injection  logic  (I3L)  technology  emphasizes  achieving  high  packing 
density  and  high  performance  by  the  use  of  various  process  innovation  combined  with  topo- 
logical design  variation.  A high  performance  has  been  achieved  without  the  use  of  Schottky 
clamping,  and  the  process  is  equivalent  in  complexity  to  any  standard  dual-layer  metal 
bipolar  technology.  The  packing  density  of  l^L  is  equal  to  NMOS  technology,  by  using  a two- 
level  metal  scheme. 
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4.1.1.13  Table  6 Description 


Table  6 lists  the  bipolar  and  MOS  technologies  that  are  currently  available  or  in  development 
tor  custom  LSI/VLSI  applications.  Table  6 was  generated  from  the  d.da  received  directly 
from  various  semi-conductor  producers,  from  the  literature  search  and  from  personal  direct 
inquiries.  The  data  specification  supplied  by  semiconductor  producers  or  journal  reports  are 
sometimes  ambiguous  and  referenced  to  non-standard  values.  Therefore,  data  specifications 
were  altered  to  a given  standard  value  for  ease  comparison. 


Table  6 contains  6 parameters  which  are  most  important  for  this  technology  survey  studv 
They  are  as  follows: 

Gate  Delay:  For  bipolar  technology,  a maximum  intrinsic  delay  for  a one  and  five-collector 
gate  was  listed.  For  MOS  technology,  a maximum  intrinsic  delay  for  fan  out  one  and  three 
was  listed,  at  5 volts  power  supply. 

Power  Dissipation  Per  Gate:  It  is  static  and  dynamic  power  dissipation  at  nominal  maximum 
frequency  with  +5  volts  power  supply.  The  nominal  maximum  frequency  is  defined  as  the 
average  of  maximum  repetition  rate  at  single  and  multiple  load  conditions. 

Speed-Power  Product:  It  is  a product  of  gate  delay  times  power  dissipation  per  gate. 

Gate  Area  per  Square  Millimeter:  It  is  a random  logic  area  with  approximately  50%  area 
assigned  to  interconnect  and  power  busing. 


Repetition  Rate:  It  is  a range  of  frequency  of  operation  where  the  lower  and  upper  end  of 
the  range  is  a function  of  the  fan-out  load. 

All  of  the  circuit  technology  listed  in  Table  6 are  referenced  to  5-7  urn  mask  rules. 
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System 

Parameters 

Circuit 

Technology 

Gate 

Delay 

Nanosec 

Power 
Gate 
MWatts  / 
Gate 

Speed 

Power 

Product 

Pico- 

Joules 

Random 

Logic 

Gate 

Area 

Gates/ 

MM2 

Number 

of 

Mask 

To 

Repetition 

Rate 

MHz 

SiGate  NMOS 

8-30 

.4-. 5 

4-12 

100-150 

6-7 

8-30 

DMOS 

6-20 

1.4-1. 6 

15-20 

225 

6 

1340 

C2L/MOS 

6-40 

1.25-2.5 

15-50 

270 

6 

640 

NMOS/SOS 

2.7-9 

2.6-3 

8.1-23.4 

100-150 

7 

28-90 

VMOS 

5-20 

.8-1 

5-16 

80-300 

7 

13-50 

CMOS/SOS 

3-20 

1-2.5 

7.5-20 

150-250 

7 

11-80 

FIRST  GENERATION 

25 

.5 

12.5 

80-160 

4 

10 

12L/MTL 

2ND  GENERATION 

4-8 

5 

2040 

60-120 

6 

30-60 

I2L/MTL 

s2l 

10 

.4-5 

4-50 

170 

5 

25 

SFL 

20-30 

.5 

12.5 

120-240 

• 

10 

SCHOTTKY  I2L 

8 

2-3 

16-24 

400 

6 

20 

MODIFIED  SFL 

UP-DIFFUSED 

2.5- 

5 

12.5- 

100 

6 

70-100 

i2l 

3.5 

17.5 

ISL 

2-5 

3-7.5 

15 

100 

6 

50-125 

I3L 

4-5 

5 

20-25 



250-300 

6 

50-62 

4.1.2  LSI/ VLSI  Memories 


There  are  several  new  ami  old  LSI  technologies  that  are  competing  for  new  generation  of 
memories  in  range  of  64  kilobits.  Table  7 lists  current  memory  devices  and  their  performance. 
Charged  couple  device  (CCD)  memories  with  65  kilobit  level  for  block  storage  application,  are 
serially  accessible  and  slower  and  more  difficult  to  use  RAMs.  The  only  reason  to  use  CCD 
is  the  price  advantage  in  order  of  two  to  one  over  RAMs.  A VMOS  device  that  has  large 
potential  density  and  low  power  consumption  is  available  in  64K  read  only  memory  into  175 
mils  square  chip.  Another  competitive  technology  is  HMOS  using  scaled-down  2-um  rule  and 
high  density,  lower  power  MOS  RAM. 

In  future,  one  or  two  years  away,  VLSI  memories  with  256K  bit  capabilities  will  emerge  out 
of  production  lines.  One  of  the  problems  in  VLSI  is  the  interconnection  on  the  chip.  This 
problem  may  be  reduced  by  use  of  double-poly  or  three-layer  metal  interconnection  in  con- 
junction with  an  innovative  logic. 
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Table  7.  Memory  Device  and  Performance 


Device 

Type 

Density/ 

Capability 

Bits 

Speed 

NSEC 

Process 

Manufacturer 

DYNAMIC  RAMS 

16K 

150-300 

N-MOS 

FAIRCHILD 

16K 

100 

13L 

NTT 

65K 

150-300 

N-MOS 

AMI 

65K 

150 

V-MOS 

AMI 

STATIC  RAMS 

4K 

150 

N-MOS 

- 

4K 

50 

H-MOS 

INTEL 

4K 

55 

V-MOS 

AMI 

8K 

150 

N-MOS 

MOSTEK 

ROMS 

64K 

80 

V-MOS 

MOSTEK 

64K 

250 

V-MOS 

AMI 

64K 

300 

H-MOS 

INTEL 

CCD 



64  K. 

200-500 

FAIRCHILD 

4.1.3  Figure  of  Merit 


The  complexity  of  the  MOSFET  and  bipolar  technology  over  the  past  several  years  has 
created  the  hard  task  of  standardizing  sensitive  parameters.  Those  parameters  are  used  for 
comparison  in  the  LSI  technology  survey.  One  of  the  very  important  parameters  is  the  power- 
delay  product  which  indicates  how  much  power  is  necessary  for  a given  gate  to  operate  at 
its  maximum  frequency  a^  a given  supply  voltage  and  fan  out  condition.  For  example,  a 
power  delay  product  of  1*-L  exhibits  linear  relationship  over  extrinsic  region  (slow  gate  delay) 
and  power  dissipation  using  injector  current  times  collector  voltage  swing  (less  than  1 volt). 
Obviously,  the  power-delay  product  parameter  will  be  low  and  impressive.  In  order  to  be 
comparative  with  the  rest  of  the  LSI  technologies  in  this  survey,  an  intrinsic  gate  delay, 
which  is  a delay  due  to  minority  carrier  charge-storage  effect,  and  5 volts  supply  voltage  was 
used  to  determine  the  power-delay  product.  As  for  CMOS  technology,  where  the  only  static 
power  dissipation  factor  was  used  to  generate  low  power  delay  product,  a given  nominal 
maximum  frequency  is  included  in  power-delay  product.  Therefore,  I~L  and  MOFSFT  tech- 
nologies can  be  easily  evaluated  and  compared. 


These  factors  are  not  the  only  ones  that  could  or  should  be  included  to  determine  the 
relative  merits  of  the  technologies  in  Table  6.  A number  of  system  considerations  should  be 
included  before  a technology  is  chosen  for  a given  application.  A fuller  discussion  of  this 
point  is  included  in  Section  4.3.  However,  a figure  of  merit  will  be  defined  using  the  speed- 
power  product,  gates  per  unit  area,  and  maximum  frequency  as  defined  for  Table  6.  Since 
lower  speed  power  products  are  preferrable,  this  factor  will  go  in  the  denominator.  Higher 
gate  densities  and  maximum  frequencies  of  operation  are  more  desirable;  therefore,  these 
factors  will  be  in  the  numerator. 

In  an  attempt  to  rate  these  technologies  for  custom  LSI,  a very  simplistic  approach  was 
taken:  Utilize  the  factors  from  Table  6 to  create  a figure  of  merit  (FOM).  Two  factors  are 
immediately  discernible,  as  significant  from  an  LSI  point  of  view  speed-power  product  and 
gates  per  unit  area. 

Speed-power  product  has  long  been  used  as  a measure  of  the  “goodness”  of  a technology.  It 
is  used  to  measure  technologies  against  constant  speed-power  lines  on  the  now-famous  gate- 
delay,  gate-power  chart.  On  that  chart,  the  lower  a technology’s  speed-power  product,  the 
better  the  technology  is  considered. 

In  evaluate  LSI  potential,  a second  factor  must  be  added  to  the  evaluation-gates  per  unit 
• f • High  gate  packing  density  is  crucial  if  a technology  has  any  hope  of  approaching  LSI/ 
VIM  potential,  because  the  integrated  circuits  will  be  smaller,  thereby  lowering  the  proba- 
ii  ulure  due  to  surface  defects  on  the  wafer.  From  a system  application  point  of 
< !••*.  i speed  technology  can  provide  sufficient  parallelism  of  operation  and  can  fit 
• i hi  in  i.  than  a higher-speed  technology,  it  may  be  more  advantageous  to  go 

» • 'l,  ev  I lie  decision  will  be  partially  based  on  the  true  maximum  speed 

v If  the  parallelism  is  too  high  or  the  safety  rtt  iigin  in  the  perfor- 
!•  * i 'I  faster  technology  may  be  chosen.  Thus,  the  maximum 
.•  i o K-  included  in  any  FOM. 


FIGURE  OF  MERIT 


Assuming  the  approach  taken  for  Table  6 lias  provided  some  standardisation  in  these  factors, 
then  the  figure  of  merit  will  be  of  the  form: 


FOM 


Nominal  Gates  per  Unit  Area 
Nominal  Speed-Power  Product 


x Nominal  Maximum  Frequency 


Utilizing  Table  6,  Figure  24  was  derived  according  to  the  FOM  operation. 

Figure  of  merit  is  plotted  against  different  available  technologies.  No  clear  cut  leading  tech- 
nology in  existence  is  indicated,  but  several  1“L  and  MOSFFT  technologies  have  the  potential 
to  be  leading  LSI/VLSI  technologies.  They  are  13L,  ISL/I2L,  CMOS/SOS  and  VMOS. 

The  l“L  high-speed  technologies,  a relatively  new  and  still  developing  process,  has  some  good 
and  bad  points.  Advantages  such  as  process  simplicity,  packing  density,  high-current  capacity, 
Schottky  diode  contacts,  low  speed-power  product,  linear  mixed  with  digital  components  and 
very  large  scale  integration  oriented  process  can  be  seen.  Bad  points  are  low  voltage  swing 
(less  than  1 volt),  low  noise  margin,  difficult  device  modeling,  additional  interface  circuitry 
(required  due  to  the  low  operating  voltage),  gamma  sensitivity  (10"  rad  Si  degrades  power- 
delay  characteristics),  and  multilayer  metal  interconnections.  Of  course,  many  of  those 
problems  are  being  reviewed  and  resolved  by  the  emerging  new  technology  concepts. 

The  MOSFFT  technology  is  more  mature  process  which  also  has  some  good  and  bad  points. 
Advantages  are  high  packing  density,  low  speed  power  product,  relative  simplicity,  straight 
forward  device  modeling,  high  yield,  good  radiation  hardening,  circuit  interface  with  T“L  logic 
and  high  noise  immunity.  Bad  points  are:  speed  limitation  due  to  high  voltage  swing,  thres- 
hold voltage,  low  interface  drive,  and  large  area  interface  drivers. 


The  future  of  LSI/VLSI  technology  lags  in  the  development  of  submicron  technology,  innova- 
tion in  logic  circuit  design  and  multilevel  interconnection.  A semiconductor  device,  having 
masked  dimensions  of  less  than  one  micron,  will  no  longer  be  fabricated  with  the  use  of 
standard  photilithographic  techniques.  The  technology  trends  will  be  based  on  the  use  of 
e-beam  and  x-rays  to  pattern  the  surface  of  the  semiconductor  wafer.  Submicron  technology 
will  benefit  both  bipolar  injection  logic  and  MOS  devices.  It  seems  that  scaled-down  tech- 
nologies are  able  to  give  very  large  scale  integrated  circuit  (V'LSl)  with  a speed-power  product 
in  the  .2  to  1 pj  range  and  delay  times  in  .5  to  1 nanosecond. 

4.2  MACROCONSTRAINTS 


Five  technologies  appear  to  be  candidates  with  good  LSI  potential.  They  arc: 

1.  I3L  4.  VMOS 

2.  CMOS/SOS  5.  Up-diffused  I~L 

3.  ISL 

* 

Kxaetly  how  good  are  each  ot  the  technologies? 

It  is  necessary  to  determine  the  various  constraints  that  each  technology  forces  on  LSI 
developments.  These  constraints  grow  from  practical  limitations  of  the  LSI  process  to  be 
used.  In  essence,  one  must  assess  the  ground  rules  of  each  technology  in  the  areas  ol. 

a.  I/O  Pin  limits 

b.  Power  dissipation  limits  9 


c.  Level  of  integration  limits 

d.  On-chip/off-chip  gate  delays 

e.  Interface  compatibility 

f.  Maximum  chip  sizes. 

Without  becoming  tutorial,  each  of  the  above  ground  rule  areas  are  simple  reflections  of  a 
given  technologies  ability  to  handle  a function  with  LSI. 

Virtually  without  regard  to  technology,  the  maximum  practical  package  size  for  dual-in-line 
packages  seems  to  be  64  pins.  Larger  sizes  have  not  become  popular.  Leadless  packs  may 
increase  this  number  to  80  or  more  pins;  however,  power  dissipation  must  be  considered 
when  dealing  with  leadless  packages. 

Interface  compatibility  is  almost  always  assumed  to  be  TTL  voltage  and  drive  levels  at  the 
interface.  Since  the  bulk  of  presently  available  commercial  circuits  have  TTL  compatibility,  it 
remains  a good  ground  rule  that  TTL  levels  be  maintained  for  interface  compatibility.  This 
ground  rule  presents  some  problems  for  the  MOS  technologies  which  operate  over  a wider 
range  ol  voltage  levels  and  do  not  provide  as  much  sink  capability  with  normal  output  buffers. 
CMOS/SOS  and  VMOS  are  each  capable  of  meeting  the  voltage  levels  with  no  difficulty  since 
each  technology  is  now  powered  by  5 volts;  however,  the  drive  levels  require  much  larger 
output  drivers  which  increases  surface  area  of  the  chip.  The  bipolar  technologies  require  some 
modification  of  the  output  devices  from  their  basic  on-chip  devices,  but  the  difference  is 
small. 

Maximum  chip  size  is  dependent  upon  the  surface  defect  density  of  the  LSI  process.  Chip 
size,  therefore,  has  a direct  bearing  on  the  yield  of  the  process.  Each  technology  has  dif- 
ferent tradeoff  points  where  the  chip  size/yield  curve  becomes  unprofitable.  However,  vendors 
are  more  comfortable  in  considering  larger  chips  with  their  improved  processing  capability. 

A reasonable  chip  size  limit  is  approximately  250  mils  on  a side,  although  the  average  size 
for  LSI  is  approximately  170  to  200  mils  on  a side. 

The  limits  of  power  dissipation,  level  of  integration,  and  gate  delay  are  the  areas  where  the 
technology  differ  significantly.  Using  the  data  accumulated  for  Table  6,  each  of  the  five 
technologies  I L,  CMOS/SOS.  1SL,  VMOS,  and  Up-Diffused  I-L  is  capable  of  phenomenal 
levels  of  integration.  The  actually  obtainable  level  of  integration  is  lower  than  value  predicted 
from  Table  6 data  because  high  functionality  forces  high  on-chip  interconnect  and  a large 
number  ot  bonding  pads.  Table  8.  therefore,  has  reduced  the  maximum  values  of  gates  by 
60%  to  account  for  the  interconnects  and  bonding  pads.  From  the  gate  count,  power  dis- 
sipation levels  were  estimated,  using  the  power  dissipation  extremes  from  Table  6. 

In  general,  all  the  technologies  are  able  to  exceed  1000  to  1 500  gates.  C'MOS/SOS.  I^L.  and  VMOS 
easily  passing  the  2000  to  3000  gate  range.  Power  dissipation  becomes  the  limiting  factor  on 
all  of  the  technologies.  A maximum  for  power  dissipation  should  be  2 Watts  or  less.  Although 
the  dissipation  of  greater  than  2 Watts  can  be  handled  with  special  packaging  or  cooling,  the 
overall  cost  is  generally  prohibitive. 

Thus,  assuming  this  2 Watt  power  dissipation  limit  for  custom  LSI.  the  practical  levels  of 
integration  for  the  various  technologies  is,  as  follows: 

a.  I^L  400  to  2000  gates 

b.  CMOS/SOS  800  to  2000  gates 


Table  8.  Comparison  of  LSI  Candidates 


Chip  Size 

100  X 100  Mils 

200  X 200  Mils 

Technology  I^L 

Gates* 

650  to  780 

2575  to  3100 

Power** 

650mW  to  3.9W 

2.6W  to  15.5W 

Speed  (Max) 

4ns 

4ns 

C.MOS/SOS 

Gates* 

390  to  650 

1550  to  2575 

Power** 

390i»W  to  1.63W 

1.5  5W  to  6.4W 

Speed  (Max) 

3ns 

3 ns 

1SL 

Gates* 

260 

1040 

Power** 

780mW  to  1.9W 

3 to  7.5W 

Speed  (Max) 

2ns 

2ns 

VMOS 

Gates* 

210  to  780 

830  to  3100 

Power** 

1 70m W to  780mW 

66O111W  to  3.1W 

Speed  (Max) 

5ns 

5ns 

UP  12L 

Gates* 

260 

1040 

Power 

1.3W 

S.2W 

Speed  (Max) 

5ns 

5ns 

*40%  of  maximum  gate  count  indicated  by  Table  6 for  each  chip  size  assumes  high  degree 
of  inter-gate  connections  and  bonding  pads. 

•♦Depends  on  percentage  of  high-speed,  high  power  gates. 

c. 

ISL 

270  to  700  gates 

d. 

VMOS 

2000  to  2500  gates 

e. 

Up-diffused  1-L 

400  gates 

The  MOS  technologies  are  definitely  LSI  candidates,  and  the  bipolar  can  be  if  the  lower  speed 
functions  are  integrated  in  the  very  low  power  I^L  and  the  high  speed  functions  use  the 
faster  I~L  variations,  i.e.,  1SL  and  Up-diffused  1“L. 

The  final  area  of  limitation  is  gate  delay  both  on-chip  and  off-chip.  In  Table  8,  all  the 
technologies  are  capable  of  high  on-chip  maximum  speeds;  however,  not  shown  on  that 
Table  is  the  fact  that  the  off-chip  delays  for  the  bipolar  technologies  are  20%  to  40%  more 
than  the  on-chip,  i.e.,  3 to  7 nsec,  and  the  off-chip  delays  for  the  MOS  technologies  are 
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more  than  100%  greater  than  the  on-chip  delays,  i.e.,  6 to  10  nsec.  This  off-chip  gate  delay 
may  be  critical  in  some  system  applications. 


limits  for 

general  LSI  development: 

a. 

I/O  pin  limit  — 

64  for  DIP,  1 

b. 

Interface  Compatibility  - 

TTL  voltage 

c. 

Maximum  chip  size 

250  mils  on 

d. 

Level  of  Integration  Limit  — 

2000  gates 

e. 

Power  dissipation  limit 

2 watts 

f. 

On-chip  gate  delay  — 

2 to  5 nsec 

g- 

Off-chip  gate  delay  - 

5 to  10  nsec 

4.3 

THE  TECHNOLOGY  DECISION 

For  LSI  to  be  effective  in  helping  military  systems  perforin  their  mission,  the  LSI  must  be 
chosen  by  balancing  the  system  needs  with  the  technological  abilities  of  the  LSI.  For  a sys- 
tem design  approach  to  accomplish  this  balancing  act,  new  methods  are  needed  for  analyti- 
cally exploring  design  tradeoffs  in  the  context  of  the  multitude  of  LSI  technological  changes. 
This  section  will  endeavor  to  discuss  a methodology  for  selecting  LSI  from  system  needs.  It 
should  be  noted,  before  any  discussion  begins,  that  every  system  requirement  forces  choices 
in  technology  which  affects  every  other  system  requirement.  Rather  than  capitulating  to  the 
seemingly  insoluble  problem  of  system  requirement  interdependence,  it  is  hoped  that  the  first 
order  effects  of  technology  on  system  requirements  can  be  isolated  so  that  the  interdependence 
is  manageable  in  our  minds. 

In  the  following  section,  system  requirement  categories  will  be  presented  along  with  the  tech- 
nology parameters  that  directly  relate  to  the  system  requirement  category  as  first  order 
effects. 

4.3.1  System  Requirement  Categories 

4.3. 1.1  Architecture 

The  architectural  design  of  a system  is  to  accomplish  the  system’s  mission  with  the  tech- 
nological tools  available  to  the  designer.  The  system  architectural  design  is  a trade-off  pro- 
cess of  allocating  system  functions  between  the  hardware  tools,  the  system  programs 
(software)  and  the  firmware  use  in  microprogram  subroutines.  The  overall  system  complexity 
can  be  reduced  by  selecting  the  proper  hardware  - firmware  - software  balance  in  the 
system. 


From  an  LSI  point-of-view,  level  of  integration,  gate  delay,  chip  I/O,  and  testing  have  the 
most  direct  effect  on  the  architecture. 

4. 3. 1.2  Environment 

The  system  is  designed  and  required  to  be  operational  under  various  environmental  conditions 
such  as  extreme  temperature  variations,  humidity,  vibration,  shock,  electromagnetic  or  nuclear 
radiation,  high  or  low  atmospheric  pressure,  etc.  The  chip  packaging,  the  temperature  range  of 
the  technology,  radiation  hardening  limits  and  noise  immunity  may  be  used  to  decide  if  a 
technology  can  meet  the  environmental  conditions  it  must  operate  in. 
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4.3. 1.3  Physical  Characteristics 

The  physical  characteristics  of  a system  are  its  weight,  volume,  power  consumption  and 
cooling  requirements.  Higher  system  speed  generally  reduces  the  physical  dimensions  of  the 
circuit  and  packaging;  however,  these  reductions  result  in  greater  heat  generation  and  power 
dissipation  necessitating  improved  cooling. 

Parameters  such  as  chip  packaging,  I/O,  power  supplies,  and  total  chip  power  impact  the  sys- 
tem volume  and  weight.  Power  is  impacted  by  the  level  of  integration,  gate  dissipation,  off- 
chip  drives  and  the  number  of  power  supplies.  Speed  is  impacted  by  the  system  architecture, 
level  of  integration,  gate  delay,  number  of  I/O,  off-chip  gate  delay  etc. 

Both  the  system  enclosure  and  internal  module  designs  are  influenced  by  a number  of  system 
requirements.  The  enclosure  is  the  buffer  between  the  system  and  its  operational  environment, 
as  well  as  supplying  the  cooling  capability  for  the  system.  All  the  factors  affecting  the 
Environment  and  Physical  Characteristic  system  requirements  categories  impact  the  system 
packaging. 

4.3. 1.4  Viablity  - Reliability,  Availability,  Maintainability  and  Survivability 

A failure-tolerant  system  is  designed  to  remain  operational  at  some  minimal  performance  level 
despite  almost  any  malfunction.  At  the  system  level,  the  impact  is  redundancy  in  components, 
or  at  the  subsystem  level,  the  capability  of  diagnosing  malfunction  and  reconfiguring  system 
fault.  Reliability,  availability  and  maintainability  of  the  system  are  directly  affected  by 
component  and  packaging  technologies,  circuit  and  subsystem  design  philosophy,  and  system 
architecture.  Reliability  is  the  probability  that  a system  will  perform  its  function  for  the 
duration  intended.  Availability  is  the  system  capability  to  be  in  operational  condition  when- 
ever needed.  Desirable  system  maintainability  is  to  replace  the  faulty  modules  without 
significantly  disrupting  the  system  activity  and  keep  down  time  to  a minimum. 

Survivability  of  the  system  is  a protection  of  system  hardware  against  nuclear  effects;  gamma 
and  x-rays,  neutron  influence,  and  electromagnetic  pulse. 

The  viability  of  a system  is  related  to  virtually  all  the  technology  parameters  previously 
mentioned.  The  power  dissipation,  i.e..  the  level  of  integration  and  gate  dissipation,  and  chip 
packaging  are  measures  of  the  temperature  the  components  will  experience  which  is  com- 
pounded by  the  environmental  extremes.  Reliability  is  greatly  affected  by  the  temperature  of 
the  components. 

Maintainability  is  related  to  testability  of  the  components,  packaging  and  I/O  pins.  Availability 
is  a measure  of  reliability  and  maintainability,  that  is,  operational  time  to  down-time. 
Survivability  is  related  to  noise  immunity,  input/output  protection  devices  on  the  chip, 
radiation  hardening,  etc. 

4. 3. 1.5  Cost 

Systems  arc  characterized  by  special  environmental  hardness  and  survivability  requirements, 
size,  weight,  power  constraints  which  require  special  design,  manufacturing  techniques  and 
quality  control.  Consequently,  the  system  cost  is  affected  by  these  requirements.  With 
advances  in  architecture,  in  LSI  component  manufacturing  techniques,  and  design  automation, 
the  cost  will  remain  relatively  same  but  at  the  same  time  will  increase  the  system  performance. 
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4.3.2  Forced-Pair  Comparison* 


The  preceding  section  summarized,  quite  briefly,  several  system  requirement  categories  which 
may  be  seen  in  the  request  for  proposal  for  any  major  military  system.  The  categories  are  an 
attempt  to  reflect  the  mission  goals  for  the  specific  system.  When  the  system  designer  begins 
the  design  of  the  system,  he  must  prioritize  the  system  requirement  categories  for  the  whole 
system,  and  often,  for  many  subsystems  and  functions.  After  the  requirements  are  prioritized, 
it  may  be  seen  that  various  functions  could  best  be  implemented  by  LSI,  and  more  specifi- 
cally, by  new  custom  LSI.  The  LSI  designer  must  now  ascertain  from  the  system  designer 
what  the  system  requirement  priorities  are  before  an  intelligent  decision  can  be  made  on  the 
LSI  technology  to  be  used. 

To  aid  the  system  and  LSI  designer,  an  empirical,  and  somewhat,  subjective  methodology  has 
been  developed  to  prioritize  the  system  requirements.  The  methodology  is  called  the  forced- 
pair  comparison.  Using  this  method  is  fairly  simple,  and  often,  the  results  are  startling  to  the 
designer.  He  will  not  realize  his  major  limitations  or  requirements  until  he  actually  uses  a 
forced-comparison  of  every  category  against  all  remaining  categories.  “Forced”  means  that  a 
decision  about  that  category  in  relation  to  the  next  category  must  be  made. 

The  Method 


The  system  requirement  categories  are  enumerated  by  the  system  designer,  such  as 


1 . Architecture 

2.  Environment 

3.  Volume 

4.  Weight 

5.  Power 

6.  System  Speed 

7.  System  Packaging 

8.  Maintainability 

9.  Reliability 

10.  Survivability 

1 1 . Acquisition  Cost 

1 2.  Logistic  Support  Cost 


The  number  assigned  to  the  category  is  used  as  an  identifier  at  this  point. 

The  system  designer  then  prepares  a Forced-Pair  Comparison  chart,  as  shown  in  Figure  25, 
which  has  the  system  requirement  category  number  from  above  along  the  left  vertical  and  top 
horizontal  axis.  The  comparison  procedure  can  now  begin,  working  from  top  to  bottom,  row 
by  row.  Category  1 is  compared  with  category  2 for  relative  importance.  If  2 is  more 
important  that  1,  a zero  is  placed  in  row  1,  colume  2 (1.  2)  and  a one  is  placed  in  (2,  1)  as 
in  Figure  26.  Category  1 is  then  compared  with  category  3,  4.  etc.  until  all  the  categories 
have  been  compared  with  all  the  other  categories.  Then  the  number  of  ones  is  counted  across 
each  row  and  entered  to  the  right  of  the  row.  Category  1 has  6-1’s,  Category  2 has  7-1's. 
etc.  From  the  count  of  total  pins,  a ranking  can  be  arranged.  In  Figure  26.  Categories  2.  3,  5, 
7 and  1 1 are  equally  ranked.  At  this  point,  the  procedure  can  be  iterated  to  break  the  tie 
for  equal  ranking. 


The  system  designer  must  then  analyze  his  ranking  witli  a final  “reasonableness”  test.  Has 
the  ranking  procedure  put  various  system  requirement  categories  higher  or  lower  in  priority 
than  they  should  be?  Are  some  categories  equally  important,  etc?  The  “reasonableness”  test 
will  reveal  that  the  Forced-Pair  Comparison  method  is  somewhat  subjective,  but  the  method 
is  useful  in  getting  the  system  requirement  categories  in  perspective,  pointing  out  where  the 
system  tradeoffs  should  be  made. 

From  the  final  ranking  the  most  important  IC  parameters  may  then  be  discerned,  thereby 
allowing  the  LSI  designer  to  choose  the  proper  LSI  technology  to  perform  the  necessary 
function. 
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Figure  25.  Forced-Pair  Comparison  Chart 
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SECTION  V 


LSI  DESIGN  AND  DEVELOPMENT 


5.0  INTRODUCTION 

In  Section  111,  the  design  philosophy  and  architecture  needs  of  a signal  processing  computer 
were  presented  and  analyzed.  Aside  from  memory,  three  major  functional  areas.  Data 
Processing/Data  Addressing.  Instruction  Addressing,  and  Multiplier/FFT,  were  analyzed  in 
depth,  and  their  architecture  structures  were  chosen  to  meet  the  needs  of  the  signal  processor. 

Within  this  chapter,  the  register  level  design  of  the  DP/DA  will  be  presented.  The  major 
architectural  substructures  ol  the  RALLI  will  be  described,  and  the  number  of  gates  per  struc- 
ture will  be  estimated.  From  these  estimates,  the  LSI  development  of  such  a chip  will  be 
analyzed  in  three  technologies.  CMOS/SOS.  1-^L.  VMOS.  and  a practical  approach  for  the 
development  will  be  concluded. 

The  IA  and  multiplier/FFT  functions  will  not  be  analyzed.  They  are  beyond  the  scope  of 
this  contract,  although  a brief  discussion  of  the  IA  will  be  included  witli  only  terse  conclu- 
sions drawn. 

5.1  REGISTER-LEVEL  DESIGN  DISCUSSION 

The  two  processors,  discussed  in  Section  III,  indicated  two  RALU  or  DP/DA  designs  were 
necessary  to  utilize  efficiently  the  different  potential  multiplier/FFT  structures.  The  block 
diagrams.  Figures  27  and  28  are  the  RALU  structures  for  Processor  I and  2,  respectively. 
Within  this  section,  the  RALU  substructures  Multiport  RAM.  Arithmetic  Logic  Unit,  Bidirec- 
tional I/O  Data  Ports.  Multiplexers.  Instruction  Registers,  and  Incrementer  will  be  discussed. 
The  general  design  and  gate  estimates  will  be  included. 

5.1.1  The  Multiport  RAM  (MPR) 

The  major  functions  necessary  for  an  MPR  are  read/write  addressing,  input  port  and  output 
port  selects,  and  the  register  file.  The  addressing  may  be  least  represented  as  a 4-line  to 
16-line  Decode/Demultiplexer  similar  to  a 74LS138.  In  the  3-Port  MPR  for  Processor  l.  one 
read  and  two-writes  are  simultaneously  possible;  therefore.  3 addresses  must  be  presented  to 

the  MPR  at  one  time.  To  accomplish  the  3-Port  addressing,  3 addresses  or  4-line  to  16-line 

Decoder/Demultiplexer  must  be  included.  The  Read  and  Write  Enables  are  the  inputs  to  the 
demultiplexers. 

For  the  4-Port  MPR  of  Processor  2,  two  reads,  two  writes,  or  one  read  and  one  write  are 
possible  simultaneously.  Two  addresses  must  be  available  to  the  MPR  at  one  time.  Two 
4-line  to  16-line  Decoder/Demultiplexer  must  be  included  for  addressing.  Additional  gates  are 
necessary  for  each  register  word  to  distinguish  the  Read/Write  functions  from  each  other. 

For  the  3-Port  MPR.  24  gates/addresser,  is  necessary.  For  the  4-Port  MPR.  40  gates/addresser 
is  necessary. 
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Figure  28  RALU  (DP/DA)  for  Processor  2 


The  input  port  and  output  port  select  are  simply  AND-gates  or  2-line  to  1-line  multiplexers. 
One  gate  per  RAM  bite  per  MPR  port  is  necessary  to  accomplish  the  selection  process.  Thus, 
a 3-Port  MPR  requires  3 gates/RAM  bit,  and  the  4-Port  MPR,  4 gates/RAM  bit. 

The  memory  elements  are  generally  D-latches,  which  permit  simultaneous  read  and  write  oper- 
ations on  the  same  or  different  addresses.  The  D-latch  requires  4 gates  per  latch;  therefore, 
the  memory  elements  require  4 gates  per  RAM  bit. 

The  total  for  the  3-Port  MPR  is  72  gates  for  addressing  and  7 gates/RAM  bit  for  port  selects 
and  D-latches;  for  the  4-Port  MPR,  80  gates  for  addressing  and  8 gates/RAM  bit  for  port 
selects  and  D-latches. 

5.1.2  The  Arithmetic  Logic  Unit 

The  ALU  must  be  able  to  receive  two  operands  from  a combination  of  the  MPR  and/or  I/O 
Data  Ports  and  perform  full  word  length  operations  and  supply  the  output  wherever  directed. 
Thus,  the  ALU  should  be  a high-speed,  parallel  function  which  is  able  to  perform  arithmetics, 
logicals,  and  shifts.  Furthermore,  for  data  dependent  operations,  the  ALU  must  be  able  to 
compare  the  operands,  detect  overflows,  propagate  a carry,  and  detect  a zero  condition. 

Table  9 gives  the  ALU  operations  and  status  flags  that  should  be  available. 

The  ALU,  described  above,  is  a very  sophisticated  function  which  requires  approximately  12 
gates  per  bit  to  perform  all  the  operations  required,  including  the  decoding  of  the  four  opera- 
tion selection  inputs  and  the  generation  of  the  status  flags  and  carry  out. 

5.1.3  The  Bidirectional  I/O  Data  Port 

Three  major  functions  are  necessary  for  the  I/O  Data  Ports  - Input  Register,  Output  Register 
and  Tristate  Output.  Figure  29  shows  the  configuration  of  the  I/O  Data  Port.  The  registers 
are  D-Lafches.  The  output  registers  hold  the  data  to  be  transferred,  freeing  up  the  MPR  and 
ALU  for  the  next  operation,  i.e.  the  MPR  and  ALU  are  not  tied  to  the  bus.  The  output  of 
the  previous  operations  is  latched,  allowing  asynchronous  transfer.  Similarly,  the  input  register 
is  for  asynchronous  reception;  thus,  the  ALU  or  MPR,  which  are  the  destinations  of  the  input, 
need  not  be  free  when  data  is  received.  The  tristate  output  is  included  so  that  when  no  data 
is  being  transferred,  the  RALU  presents  a high-impedance  to  the  bus. 

Each  register  requires  4 gates  per  bit  and  the  tristate  output  require  the  equivalent  of  2 gates 
per  bit  in  area. 
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Table'9.  ALU  Operations  and  Status  Flags 


ARITHMETIC  OPERATIONS 

STATUS  FLAGS 

A + B 

CARRY  OUT 

A - B 

ZERO 

B - A 

OVERFLOW 

A + 1 

A = B 

A - 1 

A only 

B only 

Right  Arithmetic  Shift 
Left  Arithmetic  Shift 

logical  operations 

A 

B 

A AND  B 
A OR  B 
A NOR  B 
A EXOR  B 
A EXNOR  B 


5.1.4  Miscellaneous  Functions 

The  multiplexers  within  the  RALU  are  extremely  simple,  requiring  a simple  AND  gate  for 
choosing  the  proper  input  signals  to  the  MPR  or  ALU  and  some  simple  decoding.  The  rule 
of  thumb  on  gate  count  is  approximately  one  gate  per  bit  per  input.  Thus  a three-input  mux 
requires  3 gates/bit. 

The  instruction  registers  are  D-Latch  and  require  4 gates/bit. 

The  incrementer  for  Processor  2 (see  Figure  28)  is  the  simplest  of  adders.  No  sophistication 
is  desired  for  this  function.  Seven  gates/bit  are  required. 
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GATE  ESTIMATES  FOR  THE  RALU’s 


The  discussion  in  section  5.1  gives  the  LSI  designer  the  tools  necessary  to  estimate  the  total 
gate  counts  of  the  two  processors.  From  the  gate  counts,  the  feasibility  of  the  function  in 
LSI  may  be  discerned. 

Figure  30  is  an  attempt  to  estimate  the  gates  necessary  for  the  16-bit  RALLs.  The  RALU 
functions  require  2828  gates  and  3304  gates  for  Processor  1 and  2,  respectively.  In  the 
MACRO  constraints  Section,  a practical  limit  of  2000  gates  was  presented.  The  most  logical 
approach  is  an  8-bit,  bit-sliced  RALU;  thus,  the  RALU  function  can  be  made  of  two  8-bit 
RALUs. 

Most  of  the  functions  on  the  RALU  are  simply  reduced  to  one-half  in  size  and  gate  count; 
however,  the  MPR  Address  Decoder  and  the  Instruction  Register  must  remain  full-size  because 
the  same  level  of  control  is  necessary,  only  the  number  of  bits  controlled  is  reduced. 

Figure  31  reflects  the  8-bit  RALUs  and  their  gate  count.  Both  RALUs  are  well  below  the 
2000  gate  limit.  The  penalty  paid  for  the  duplication  of  control  was  minimal  in  this  case; 
however,  a similar  conclusion  cannot  be  drawn  about  other  chips  unless  a full  analysis  is 
performed. 

It  is  concluded  at  this  point  that,  indeed,  an  8-bit,  bit-sliced  RALU  is  the  proper  approach 
for  the  Data  Processor/Data  Addresser  from  an  LSI  point-of-view.  In  the  next  section,  three 
LSI  development  approaches  will  be  evaluated. 

5.3  LSI  DEVELOPMENT  APPROACHES 

Three  LSI  technologies  - CMOS/SOS,  I3L,  and  VMOS,  are  reasonable  choices  to  use  to 
develop  the  8 bit  RALU’s.  These  technologies  will  be  analyzed  in  four  areas,  chip  size, 
power,  fundamental  speed  and  availability.  For  completeness  the  analysis  data  for  the  16-bit 
RALU’s  will  be  included  primarily  to  further  justify  the  bit-sliced  approach. 

5.3.1  Chip  Size 

The  chip  size  of  the  RALU’s  has  been  estimated  using  the  gates  per  MM~  data  from  Table  6 
Section  IV.  The  estimate  assumes  that  an  average  of  40%  of  the  best-case  gate  density  found 
in  the  Technology  Survey  is  actually  attainable  because  the  high  degree  of  interconnect  of  this 
function  and  the  high  number  of  I/O  pins  will  limit  the  gate  density.  Even  with  this  assump- 
tion, all  three  subject  technologies  are  capable  of  exceeding  2000  gates  in  a 200  x 200  mil 
chip.  The  estimate  also  assumes  a square  chip. 

Table  10  reflects  the  results  of  the  technologies.  Each  chip  size  is  specified  in  a range  which 
is  a manifestation  of  the  high  and  low  gate  densities  for  the  technologies.  All  three  technolo- 
gies are  capable  of  producing  a chip  under  200  x 200  which  will  perform  the  8-bit  RALU 
function.  l3L  has  the  best  density,  and  VMOS  has  the  most  inconsistent  density.  The  incon- 
sistent density  reflects  conflicting  information  sources  with  different  stories  to  tell. 


88 


RALU  FOR  PROCESSOR  I 


FUNCTION  QUANTITY  (BITS)  I TOTAL  ON  RALU  GATES  PER  BIT  TOTAL  GATES 


I AO  DATA  PORT 


MPR  DECODE 


SELECT  MUXES 


INSTRUCTION  REG 


RALU  FOR  PROCESSOR  II 

FUNCTION 

QUANTITY  (BITS! 

I 

TOTAL  ON  RALU 

I 

GATES  PER  BIT 

TOTAL  GATES 

I/O  DATA  PORT 

8 

X 

3 

X 

10 

240 

MPR 

128 

X 

1 

X 

8 

1024 

MPR  DECODE 

- 

X 

2 

X 

40 

80 

ALU 

8 

X 

1 

X 

12 

96 

SELECT  MUXES 

8 

X 

4 

X 

3 

96 

INSTRUCTION  REG 

26 

X 

1 

X 

4 

104 

ADDRESS  PORT 

8 

X 

1 

X 

6 

48 

INCREMENTER 

8 

X 

1 

X 

7 

56 

TOTAL 

1744 

Figure  3 1 . 8-Bit  RALU  (Lite  Fstimates 
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'Table  10.  Technology  Analysis 


PROCESSOR  1 

PROCESSOR  2 

TECHNOLOGY 

16-bit  RALU 

8-bit  RALU 

16-bit  RALU 

8 bit  RALU 

CMOS/SOS 

Chip  Size* 

210-270 

153  197 

227  292 

165  212 

Power  (watts) 

2.8  - 7.1 

1 

Uz 

oo 

3.3  8.3 

1 7 4.4 

R-R  Add**  Time  (ns) 

33 

33 

33 

33 

lI * 3L 

Chip  Size* 

191  210 

139  153 

206  227 

150  165 

Power  (watts) 

2.8  14.1 

1.5  7.5 

3.3  16.5 

1 7 8.7 

R-R  Add**  Time  (ns) 

44 

44 

44 

44 

VMOS 

Chip  Size* 

191  3b  9 

139  269 

206  399 

150  290 

Power  (watts) 

2.3  2.8 

1 .2  1 .5 

2.6  3.3 

1 .4  1.7 

R R Add**  Time  (ns) 

55 

55 

55 

55 

•Measured  in  mils  on  a side 
••Register-to-Register  Add 


5.3.2  Power 

The  power  has  been  estimated  from  the  gate  dissipation  data  in  Table  b.  The  estimate  was 
normalized  to  include  speed  as  a factor  in  the  gate  dissipation;  consequently,  CMOS/SOS  has 
a gate  dissipation  in  the  low  milliwatt  range  when  it  is  normally  reported  in  the  microwatt 
or  nanowatt  range.  For  CMOS/SOS  to  reach  signal  processing  speed,  the  supply  voltage  must 
be  increased  to  10  volts  and  the  power  dissipation  simply  goes  up.  In  Table  10.  VMOS  has 
a lower  power  dissipation  per  gate  than  CMOS/SOS.  but  VMOS  is  assumed  to  have  a lower 
speed  potential. 

I3L  also  has  a wide  power  dispersion  which  truly  retlects  the  power/speed  diversity  of  the 

I-L  technology,  of  which  l3L  is  a variant.  Unlike  the  MOS  technologies.  13L  could  maintain 

maximum  performance  and  have  a power  below  the  maximum  of  Table  10  if  the  lower  speed 
data  paths  are  carefully  chosen  and  the  13L  gates  tailored  for  lower  power. 
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The  power  prospects  of  I^L  are  limited:  however,  I^L  is  the  worse  power  dissipator  of  the 
three.  If  a low  operational  speed  is  assumed,  F'L  is  under  the  2 watt  per  chip  limit  discussed 
in  Section  4.2.  VMOS  will  definitely  meet  this  requirement,  and  CMOS/SOS  is  probably  able 
to  handle  this  function  under  2 watts  if  most  of  the  gates  are  assumed  to  be  “off"  during 
most  of  the  time.  This  assumption  is  reasonable  since  only  the  MPR  words  being  addressed 
are  "on";  thus,  most  of  the  MPR  is  "off"  or  inactive. 

5.3.3  Fundamental  Speed 

For  this  discussion,  the  fundamental  speed  will  be  defined  as  the  time  required  to  perform  a 
basis  operation  such  as  a register-to-register  add.  a compare,  a register-to-register  logical  opera- 
tion. etc.  The  register-to-register  add  is  assumed  to  be  a representative  operation  for  the 
problem  set  and  was  analyzed  in  gate  delays  as  follows: 

a.  2 delays  for  Instruction  Register  Setup 

b.  1 delay  to  read  to  data  out  of  the  MPR 

c.  1 delay  to  pass  thru  the  ALU  select 

d.  4 delays  in  the  ALU 

e.  1 delay  to  pass  thru  the  PR  select 

f.  2 delays  to  store  the  data  into  the  MPR 

A total  of  1 1 delays  is  required  for  this  operation.  The  fundamental  speed,  therefore,  is  the 

product  of  the  total  number  of  gate  delays  and  the  gate  delay  time. 

Table  10  includes  the  Register-to-Register  Add  Time.  All  the  times  were  calculated  using 
minimum  gate  delays  and  include  no  delay  time  estimate  for  interconnect  path  length.  As 
seen  in  the  Table,  all  these  technologies  are  capable  of  high  speed  operation.  CMOS/SOS  has 
the  best  potential  at  this  time. 

5.3.4  Availability 

CMOS/SOS.  l^L,  and  VMOS  are  probably  best  described  as  in  the  early  stages  of  maturity, 
which  means  that  each  has ’been  demonstrated  with  commercial  products  in  the  marketplace; 
however,  wide  lines  of  products  are  not  available  as  yet. 

C'MOS/SOS  has  the  longest  history.  The  early  days  were  rough,  but  CMOS/SOS  is  available 
from  RUA.  HP  and  Rockwell  with  new  sources  coming.  Because  C'MOS/SOS  has  a good 
temperature  range,  high  noise  margin,  radiation-hardening  potential,  etc.,  the  military  market 
is  good,  thus,  the  availability  is  sure  to  increase  with  time. 

L*l.  and  VMOS  are  new  stars  on  the  horizon.  I^L  is  an  extension  of  1~L  which  is  a simple 
process  in  concept,  but  difficult  if  high  speed  is  desired.  Fairchild  is  th(j  only  source.  Other 
high  speed  l“L  variant^  are  becoming  available  ISL  and  Up-Diffused  l“L.  Time  will  tell! 

For  now,  high  speed  1~L  (L'L)  has  limited  availability. 


VMOS  for  LSI  is  solely  sourced  by  AMI.  ll  is  a variation  of  NMOS  which  gives  1 um 
channel  length  using  4-0  um  layout  rules.  As  the  photolithographic  process  improves,  VMOS 
will  improve  in  density  without  the  heartaches  that  IIMOS  from  Intel  will  have  to  go  through 
VMOS  is  a winner!  Second -sourcing  will  come,  hut  availability  is  very  limited  at  this  time. 

5.4  IA  CHIP 

This  discussion  about  the  instruction  addressing  will  not  be  detailed,  and  is  only  included  for 
a measure  of  completeness.  The  IA  includes  four  major  subfunctions  a microsequencer.  a 
loop  counter,  an  interrupt  control  unit  and  flag  logic. 

The  microsequencer  and  the  loop  counter  are  orderly  functions  resembling  the  Ml’R  Al  l'  of 
the  RAl  l>  and  the  incrementer  of  the  Processor  II  RAl  V.  respectively.  To  accomplish  the 
IA  function  in  either  Processor,  a 12  bit  wide  "processor"  would  be  necessary.  Cither  of 
these  subfunctions  could  be  bit-sliced  or  included  upon  the  same  chip.  1'he  l ICO  Stack 
would  have  to  be  limited  to  S words  \ 12  bits,  which  is  a reasonable  si/e  before  the  func- 
tions could  logically  be  placed  on  the  same  chip. 

Hie  interrupt  control  unit  and  the  flag  logic  would  generally  be  called  "random  logic"  which 
implies  low  gate  to  I/O  pin  ratios  and  low  gate  count  totals.  These  functions  have  a high 
degree  of  interaction  internally  (see  Figure  .*2)  which  would  limit  the  ability  to  slice  these 
functions  Although  flexibility  will  be  reduced,  the  most  efficient  1 SI  approach  is  to  put  all 
of  these  subfunctions  on  one  chip. 

To  mimnii/e  the  number  of  off-chip  drives  between  the  four  subfunctions,  it  should  be  deter 
mined  it  the  two  sections  can  be  placed  on  the  same  chip.  The  total  gate  count  of  all  four 
major  subfunctions  appears  to  be  less  than  1500  gates.  Following  the  analysis  on  the  HP  DA. 
it  is  reasonable  to  assume  the  total  IA  function  could  be  integrated  onto  one  chip. 

Ihe  final  remaining  concern  is  speed.  1'he  total  number  of  gate  delays  in  the  microsequencer 
should  be  roughly  equivalent  to  the  legist er-to-register  add  time  of  the  RAl  l>.  thus,  if  the 
PP/PA  and  the  IA  are  developed  in  the  same  technology,  the  IA  should  be  able  to  support 
the  high  instruction  rate  of  the  PP  PA 
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Figure  32.  Controller  Without  Microsequencer 


SECTION  VI 


SIGNAL  PROCESSOR  COMPARISON 


60  INTRODUCTION 

As  a part  of  the  Multimode  Central  Processing  Unit  (MMCPU)  Design  Study  Contract,  a 
comparison  of  signal  processors  has  been  performed. 

The  main  thrust  of  the  comparison  is  based  upon  three  benchmark  problems  which  are 
applied  to  3 microprocessor  architectures, 

Tracor/RCA 

Raytheon 

Litton 

presented  herein,  these  microprocessors  are  specifically  suited  to  signal  processing  applications. 

We  shall  briefly  describe  the  system  architecture  as  a tutorial.  This  description  will  then  lead 
to  the  microprocessor  and  its  application  to  the  benchmark  problems.  Details  can  be  found 
in  the  references,  section  6.9. 

The  following  sections  of  the  study  include  discussion  of. 

• Definition  (Section  6.1) 

• Macro  Computer  (Section  6.2) 

• Building  Blocks  (Section  6.3) 

• Central  Processing  Unit  (Section  6.4) 

• Controller  (Section  6.5) 

• Microcomputer  (Section  6.6) 

• Benchmarks  (Section  6.7) 

• Comparison  of  Results  (Section  6.8) 

• References  (Section  6.9) 

• Benchmarks  - Coding  (Appendix  B) 

• Benchmarks  Timing  (Appendix  C) 

6.1  DEFINITIONS 

Since  many  users  of  data  processing  systems  are  not  acquainted  with  the  techniques  and 
terms  used  in  data  processing,  we  shall  briefly  describe  some  of  these  as  they  relate  to  the 
system  architecture. 

It  is  the  hallmark  of  a scientist  that  he  define  his  terms,  for  only  then  can  semantic  confusion 
be  eliminated. 

There  are  terms  like  “storage"  which  is  .preferred  over  “memory."  However,  the  latter  term 
is  so  heavy  impressed  in  the  literature  that  it  is  hard  to  change  it.  The  term  “instruction 
location"  is  more  descriptive  than  “program  counter.” 


Note  the  difference  below  between  Macro  and  Micro,  and  the  ditference  between  Computer 
and  Processor.  Figure  33  shows  a simplified  definition  block  diagram.  It  shows  the  boundary 
lines  for  the  purpose  of  the  definitions. 

Macro  Computer  Executes  Macro  and  Microinstructions.  It  is  the  combination 

of  the  host  computer  with  the  Microcomputer. 


Host  Computer  Control 


Microcomputer 


Microprocessor 


ALU 

RALU 

CPU 

MPR 

Building  Blocks 
Controller 
Data  Processor 
Data  Addressing 


Provides  Macro  instructions,  obtained  from  Program  Storage 
and  puts  t into  the  Instruction  Register.  Through  the  Map- 
ping it  controls  the  Microcomputer.  The  feedback  from  the 
Microcomputer  is  not  shown  which  simplifies  the  diagram. 

Executes  Microinstructions.  If  consists  of  the  Microprocessor 
and  Storage.  Storage  includes  Firmware  Storage  and  or 
Operand  Storage. 

Consists  of  the  Controller  (Sequencer),  Decoder,  and  Register 
Arithmetic  Logic  Unit  (RALU).  Often  very  limited  Read 
Only  Storage  and  Random  Access  Storage  for  Firmware  and 
Operands  respectively  are  provided  within  the  Microprocessor. 

Arithmetic  Logic  Unit  performs  additions,  subtractions,  and 
logical  operations. 

Register  Arithmetic  Logic  Unit,  often  called  Central  Proces- 
sing Unit  (CPU).  It  contains  Multi  Port  Registers  (MPR) 
Multiplexers  including  shifter.  Arithmetic  Logic,  and  Control 
Decode. 

Central  Processing  Unit  same  as  RALU. 

Multi  Porf  Registers,  Register  stack  with  multiple  access  ports 
(addresses)  and  capable  of  multiple  operations  (read  and  write). 

Blocks  used  to  construct  a Microprocessor  or  Microcomputer. 

In  this  context,  the  sequencing  of  Microinstructions. 

Processes  operands. 


Processes  operand  addresses. 


As  Large  Scale  Integrated  (LSI)  circuits  progress  to  accommodate  more  circuits,  more  and 
more  functions  are  included  in  a single  chip.  Thus,  the  dividing  of  functions  becomes  more 
and  more  difficult. 


Pipeline  Two  meanings. 

a.  Arrangement  of  multiple  Arithmetic  Logic  Units  to  pro- 
vide execution  of  multiple  Microinstructions  concurrently 
Similar  to  array  processing. 
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Figure  33.  Definition  Block  Diagram  (Simplified) 


b.  Concurrent  execution  of  Microinstruction  with  fetching 
of  Microinstruction.  A better  term  would  be  “prefetch.” 

Array  Processor  Arrangement  of  multiple  RALUs  or  CPUs  to  provide  con- 

current execution  of  several  Microinstructions,  or  concurrent 
execution  of  several  functions  provided  by  a single 
Microinstruction. 

Comparison  of  architectures  are  based  upon  the  selection  of  parameters.  Further,  a hierarchy 
of  the  parameters  and  weighting  of  the  parameters  is  to  be  established.  Qualitative  param- 
eters are  to  be  assigned  to  each  parameter.  To  create  such  a framework  is  an  extremely  dif- 
ficult task,  especially  when  the  architectures  differ  widely. 

As  an  alternate  approach,  it  is  suggested  to  present  the  strong  and  weak  points  for  each 
system.  A final  comparison  is  based  upon  the  results  from  applying  the  benchmarks  to  each 
Microprocessor.  A future  comparison  may  include,  but  is  not  limited  to  such  parameters  as 
flexibility,  growth  capability,  number  of  chips,  technology,  clock  speed,  cost  of  hardware/ 
firmware/software,  program  support  capability,  and  life  cycle  cost. 

6.2  MACRO  COMPUTER 

A Macro  Computer  is  capable  of  executing  Macro  and  Microinstructions.  A Microcomputer 
executes  Microinstructions.  A subset  of  the  Microcomputer  is  the  Microprocessor. 

Signal  processing  applications  use  a Microprocessor  as  the  main  hardware.  For  completeness 
of  this  report,  a brief  description  of  the  Macro  Computer  capability  for  each  of  the  three 
manufacturers  is  presented  below.  All  three  vendors  provide  such  a capability.  Interface  to 
the  host  computer  is  included  in  the  description. 

6.2.1  Tracor/RC'A  Macro  Computer 

Tracor  provides  a General  Processing  Unit  (GPU)  chip  which  is  similar  to  the  Advanced 
Micro  Device  AM  200 1 or  the  Motorola  MC  2601  chip.  Since  Tracor  provided  only  descrip- 
tion of  GPU  which  is  simply  a central  processing  unit  or  RALU  and  not  a Microprocessor 
nor  a Macro  Computer  architecture,  we  used  the  AMI)  and  Motorola  descriptions  of  simple 
Microcomputers  to  conjecture  similar  structures  for  the  GPU. 

Figure  34  is  from  reference  C;  it  does  not  show  explicitly  the  RAM  which  is  connected  to 
the  control/address/data  bus.  The  RAM  contains  the  program  and  operands.  Micro  Instruc- 
tions of  the  program  are  loaded  into  the  Instruction  Register  which  arc  mapped  into  the 
Micro  Program  Sequencer.  Microinstructions  are  retrieved  from  the  Microprogram  Memory 
The  Pipeline  Register  provides  for  fetching  the  next  Microinstruction  while  the  current 
Microinstruction  is  being  executed. 

6.2.2  Raytheon  Macro  romouler 

Figure  35  is  from  reference  I).  The  Sequencer  (SEQ)  contains  both  Macro  Computer  and 
Microprocessor  control.  The  Host  Computer  provides  for  Macro  Instruction  control. 
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Figure  34  Advanced  Micro  Device  Microcomputer 


Figure  35  Raytheon  Macro  Computer 


6.2.3  Litton  Macro  Computer 

Figure  36  shows  a typical  example  of  the  Litton  Macro  Computer.  A CPU  chip  is  used  to 
provide  for  the  controller  function  and  for  Data  Addressing  as  well  as  Data  Processing  func- 
tion. The  Emulation  part  in  Figure  36  provides  for  emulation  of  Macro  Instructions.  Macro 
Instructions  are  held  in  the  Instruction  Register  (IRL  The  ('on t roller  executes  the 
Microinstructions. 

6.3  BUILDING  BLOCKS 

As  shown  in  the  previous  section,  a typical  Microcomputer  consists  of  the  following  building 
blocks. 


RAM 

Random  Access  Memory 

ROM 

Read  Only  Memory,  or 

PROM 

Programmable  ROM 

MPY 

Multiply  Chip 

I/O 

Input/Output 

CPU 

Central  Processing  Unit,  including 

ALU 

Arithmetic  Logic  Unit 

CONT 

Controller 

I urthcr.  the  opciation  ol  the  iutciconnoctcd  building  blocks  is  a (unction  ot  the  program  oi 
firmware.  Program  and  firmware  arc  produced  using  a programming  language  which  is  not 
subject  to  this  report. 


INT 

H(  O ACK 


Figure  3f>.  Litton  Macro  Computet 
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It  is  assumed  the  reader  is  familiar  with  most  of  the  building  blocks;  therefore,  a description 
can  be  omitted.  However,  the  important  blocks  of  a microprocessor  - CPU  and  Controller 
will  be  analyzed  in  further  detail.  Differences  in  the  architecture  of  these  blocks  will  show 
an  effect  on  the  benchmark  problems. 

6.4  CENTRAL  PROCESSING  UNIT  (CPU) 

The  name  CPU  is  misleading,  but  it  is  used  in  this  report  because  the  semiconductor  manu- 
facturers have  adopted  it.  In  this  report,  the  CPU  is  a chip.  The  architecture  of  the  CPU 
varies  from  manufacturer  to  manufacturer.  It  depends  upon  the  chip  size,  number  of  gates, 
technology,  and  number  of  pins.  The  CPU  under  consideration  may  be  classified  bit  slices. 

The  slices  are  typical  4-bit  and  8-bit  wide  and  can  be  cascaded  to  provide  16  bit 
Microprocessors. 

The  references  provide  for  detailed  description.  Only  highlights  are  given  in  this  report. 

6.4. 1 Tracor  GPU  j] 

This  chip  is  called  General  Processing  Unit  (GPU)  and  is  shown  in  Figure  37.  It  has  the 
following  characteristics: 

8 bit  slice 

16  registers,  3 ports/2  operations 

ALC  (Arithmetic  Logic  Circuit,  limited  ALU) 

Input  and  Output  J 

Concatenation  logic  for  any  word  length  | 

This  chip  is  similar  to  the  Advanced  Micro  Device  AM  2001  or  Motorola  MC  2001 
(Figure  38)  chips.  The  2001  is  a 4 bit  slice  chip  which  contains  an  additional  Q register. 

6.4.2  Raytheon  Arithmetic  9 

Figure  30  shows  a block  diagram.  The  arithmetic  is  performed  in  three  stages,  a so-called 
pipeline  architecture.  For  certain  applications,  this  arrangement  has  certain  advantages.  How- 
ever. the  data  passes  through  the  pipe  in  sequence.  A time  penalty  is  paid  sometimes  for 
tilling  the  pipe  and  for  execution  of  single  functions.  The  arithmetic  includes  a multiply 
function  for  fast  multiplications.  The  ALU  is  a double  ALU  each  12  bits  wide,  which  can 
perform  concurrently  two  operations  including  operations  on  double  length  or  complex  oper- 
ands in  one  timing  unit. 

6.4.3  Litton  Multimode  Central  Processing  Unit  tMNK'PU) 

Figure  40  shows  the  block  diagram  which  has  the  following  features: 

8 bit  slice 

16  bit  register,  4 ports,  3 operations 
ALU 

3 Bidirectional  I/O  Ports 

This  chip  can  be  used  tor  two  functions  Data  Processing  (DP),  and  Data  Addressing  (DA). 
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liiiure  38.  Advanced  Micro  Device  Microprocessor  Slice 
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CONTROLLER  OR  SEQUENCER 


To  build  a Microcomputer,  one  needs  the  following  building  blocks: 

CPU  or  RALU 
^Controller 
ROM 
RAM 

The  controller  accesses  ROM  to  fetch  a Microinstruction.  It  decodes  the  Microinstruction  to 
staer  the  CPU/RALU  and  other  functions  in  the  same  time  interval  it  updates  the  ROM 
address  for  accessing  the  next  Microinstruction 

Advanced  architectures  use  a register  to  hold  the  Microinstruction  while  the  next  Microinstruc- 
tion is  being  assessed.  Thus,  instruction  execution  and  next  instruction  access  occurs  at  the 
same  time.  This  is  often  called  "pipeline  operation."  A more  appropriate  name  would  be 
"instruction  prefetch.”  Thus  avoiding  confusion  with  "pipeline  operation"  referring  to  * 

sequential  operation  through  serially  connected  RALUs  or  RALUs  in  un  urniy. 

ft. 5 . 1 l>acor/MC29Q9  Controller 

In  absence  of  a Traeor  Controller,  the  MC  2W  has  been  substituted.  Figure  41  shows  the 
MC  2909  Microprogram  Sequence  block  diagram.  It  is  a 4 bit  slice  and  is  cascadable.  A 
4x4  file  with  stack  pointer  and  push/pop  control  provides  for  nesting  subroutines.  Direct 
inputs  provide  for  N-way  branches. 

6.5.2  Raytheon  Controller 

The  control  function  in  the  Raytheon  Microcomputer/Microprocessor  are  distributed. 

Figure  42  shows  a block  diagram.  The  dotted  line  encloses  the  controller  function  with  the 
following  blocks  SEQ,  PIPE  A,  PIPE  B.  A FIFO  (first  in  first  out)  is  used  to  shift  con- 
trol from  block  to  block  in  this  pipeline  architecture.  Details  of  the  shift  logic  is  shown  in 
Figure  43.  The  PROM  is  used  to  decode  the  control  code  obtained  from  the  shift  registers. 

Note:  For  consistency  with  our  definitions,  the  Raytheon  MACRO  (see 
Reference  D)  is  functionally  equivalent  to  a Microinstruction. 

6.5.3  Litton  Controller 

Figure  44  shows  the  Controller  block  diagram.  It  is  a 12  bit  slice.  A LIFO  stack.  8 deep 
provides  for  subroutine  nesting.  Interrupt  and  branch  logic  is  included. 

6.6  MICROCOMPUTER 

A typical  basic  Microcomputer  consists  of  a Microprocessor  and  storage,  i.e.,  a Controller,  a 
ROM,  a CPU,  and  a RAM.  Microprocessors  for  signal  processing  applications  must  provide 
for  high  throughput.  This  is  accomplished  in  several  ways: 

technology 
architecture 
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Figure  42.  Raytheon  Controller 
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Figure  43.  Raytheon  Controller  Detail 


The  speed  of  a processor  depends  upon  the  selected  circuit  technology.  For  comparison 
purposes,  it  is  assumed  all  processors  would  use  equivalent  circuit  speed.  Therefore,  the 
architecture  will  provide  for  speed  advantage. 

First,  it  is  assumed  that  all  processors  provide  for  concurrent  operation  of  controller  and 
CPU.  This  means  that  the  execution  of  a Microinstruction  occurs  in  the  same  time  interval 
as  the  fetching  of  the  next  microinstruction  in  the  sequence. 

Second,  it  is  assumed  that  two  CPU  type  chips  will  provide  for  Data  Processing  (DP)  as  well 
as  Data  Addressing  (DA)  function.  This  architecture  provides  for  concurrent  data  address 
updating  as  well  as  data  processing.  Note,  in  each  time  interval.  3 functions  are  performed 
DA.  DP.  Control. 

Third,  it  is  assumed  a special  function  chip  provides  for  fast  multiply. 

Typically,  all  Microcomputers  and  Microprocessors  have  a very  wide  Microinstruction  format 
which  reduces  the  decoding  logic  and  therefore  gives  a speed  advantage.  The  number  of 
functions  performed  by  a CPU  dictates  the  number  of  bits  to  be  accommodated  in  the  format 

o.o.  1 fracor  MCN01  Microcomputer 

A single  Microcomputer  architecture  around  the  MC  2'H)I  and  MC  .NO*)  is  shown  in 
Figure  45.  This  type  of  architecture  is  applicable  to  the  l'racor  CPU. 


I 10 


START 

ADORESS  CLOCK 


TO  OTHER 

0EV,CES  78453-47 


Figure  45.  Microprogrammed  Architecture  Around  MC2901’s 


An  advanced  Microcomputer  architecture  is  shown  in  Figure  46.  The  GPU  chip  is  applied 
to  two  functions,  DP/DA  to  obtain  further  concurrent  operation. 

The  Data  Addressing  (DA)  and  Data  Processing  (DP)  functions  operate  concurrently.  The  DA 
provides  for  address  and  index  computation  while  the  DP  performs  the  operation  on  the 
operands.  A multiply  chip  is  a special  function  chip.  Appropriate  multiplexers  (MUX)  route 
the  data  according  to  the  control  obtained  from  the  Microinstruction.  This  architecture  will 
be  applied  to  the  benchmark  problems  for  comparison  purposes. 

6.6.2  Raytheon  Microcomputer 

Figure  47  shows  the  block  diagram  of  the  Raytheon  Macro  Computer/Microcomputer.  Only 
the  Microcomputer  function  will  be  used  in  comparing  the  benchmarks. 

The  architecture  shows  an  Address  generation  (ADGN)  which  operates  concurrently  with  the 
arithmetic.  Operand  preparation  is  performed  in  Pipe  A.  The  arithmetic  function  is  per- 
formed in  Pipe  B.  Pipe  A and  Pipe  B are  in  series.  Advantages  and  disadvantages  of  such  an 
architecture  will  be  reflected  in  the  evaluation  of  the  benchmarks. 

6.6.3  Litton  Microcomputer 

Figure  48  shows  the  Litton  simplified  Microcomputer  using  the  MMCPU.  The  controller  and 
firmware  memory  provide  for  Microinstruction  sequencing.  The  Data  Address  (DA)  calculates 
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the  operand  addresses.  The  Data  Processing  (DP)  function  provides  for  operation  on  operands. 
Controller,  DA,  and  DP  operate  concurrently.  This  architecture  is  compatible  with  the  Tracor 
and  Raytheon  architectures.  A more  advanced  Litton  Microcomputer  architecture,  provides 
for  additional  concurrent  operation  and  is  excluded  from  the  current  comparison  (see 
Section  111). 

Note:  The  multiplexers  (MUXs)  in  Figure  48  do  not  actually  exist.  They 

have  been  included  in  the  diagram  for  ease  of  understanding;  however, 
the  tri-state  outputs  of  the  MMCPU  do  not  require  additional  MUXs. 

0.7  BFNOIMARK  COMPARISON 

(he  three  Microprocessor  architectures 

iracor  (Figure  4<>) 

Raytheon  (Figure  47) 

Litton  (Figure  48) 

which  have  been  described  m section  6.0  will  be  compared  based  upon  selected  benchmarks 
Lite  following  benchmarks  have  been  described  in  Section  II. 

Fast  Fourier  Transform  (FFT) 

Weighted  FFT 
Cosine  Transformation 
Coordinate  Conversion 
Constant  False  Alarm  Rate  (('FAR) 

Sorting  of  Pulse  Repetition  Frequencies  (PRF) 

The  first  three  benchmarks  are  related  to  each  other  and  are  subsets.  Therefore,  only  the 
FFT  will  be  coded.  The  Weighted  FFT  and  Cosine  Transformation  benchmarks  would  give 
results  which  are  similar  to  the  FFT. 

The  PRF  benchmark  includes  the  following  two  problems: 

Pulse  Classification,  and 
PRF  Sorting 

This  coding  and  comparison  of  this  benchmark  have  not  been  included  in  this  comparison 
because  the  Raytheon  Micro  Signal  Processor  would  be  unfairly  viewed. 

Detailed  coding  sheets  for  the  benchmark  problems  are  listed  in  Appendix  B.  The  timing 
calculations  are  given  in  Appendix  C A comparison  of  the  results  will  be  given  in  the  next 
section  6.8 

6.7.1  Algorithms 

The  benchmarks  are  described  in  Section  II.  Also,  derivations  of  formulas  for  approximations 
of  trigonometric  functions  can  be  found  in  the  same  section.  This  section  will  only  list  the 
equations  which  have  been  directly  coded 
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Coding  of  isolated  benchmarks  can  be  misleading.  The  designer  should  always  keep  the  total 
system  in  mind.  Therefore,  the  benchmark  may  not  give  the  total  story.  The  programming 
strategy  and  style  greatly  depends  upon  the  application.  For  example,  buffer  space  may 
depend  upon  the  coding  of  a single  data  point  or  the  repetition  of  data  points.  In  other 
words,  a formula  may  be  applied  to  one  data  point  followed  by  the  next  data  point.  Alter- 
natively, the  first  step  of  the  formula  may  be  applied  to  all  data  points  followed  by  the  next 
step  in  the  formula.  This  will  effect  the  indexing  through  the  data  base  and  affect  the 
requirements  for  temporary  storage  as  well  as  the  throughput.  Careful  analysis  influences  the 
selection  of  an  approach. 

Small  systems  versus  large  systems  can  influence  the  coding.  In  small  systems  one  can  code 
routines  in-line,  i.e.,  with  a minimal  number  of  subroutines,  jumps,  calls,  etc.  This  coding 
technique  provides  for  high  throughput  at  the  expense  of  larger  program  (firmware)  storage. 

In  larger  systems,  one  would  code  routines  as  subroutines  or  call  routines.  This  approach 
provides  for  efficient  initialization  of  routines  at  the  price  of  an  overhead  in  calling  the  sub- 
routines. Further,  the  subroutines  may  use  registers  which  have  to  be  saved  at  the  entrance 
to  the  subroutine  and  must  be  restored  before  leaving  the  subroutine.  The  passing  of  param- 
eters to  the  subroutines  is  provided  by  preassigned  registers. 

In  total  system  programming,  the  initialization  of  subprograms,  such  as  benchmarks,  must  be 
considered.  This  overhead  is  normally  not  included  in  benchmark  problems  and  depends 
greatly  upon  the  architecture. 

6.7.2  Instructions  and  l iming 

Conventionally  benchmarks  are  compared  upon  the  following  parameters: 

Number  of  instructions  storage 

Operand  and  temporary  storage 

Number  of  instructions  executed  throughput 

Time  to  execute  benchmark 

The  Tracor  and  Litton  Microprocessors  architecture  are  similar.  Therefore,  the  analysis  of 
Microinstructions  are  grouped  together.  The  Raytheon  is  of  a different  type  of  architecture. 
Iherefore,  the  total  number  of  “MACRO"  instructions  will  not  be  tabulated  nor  compared. 

I he  operand  storage  can  be  excluded  from  the  comparison  when  assuming  that  the  data  bases 
for  all  Microprocessors  are  very  similar.  One  should  note  that  the  total  storage  capacity  in 
bits  for  the  Raytheon  architecture  is  smaller  due  to  its  12  bit  word  length  as  compared  with 
the  16  bit  word  length  for  the  Tracor  and  Litton  storage.  Further,  storage  inefficiencies  in 
the  Raytheon  architecture  may  occur  due  to  its  addressing  structure.  Always  double  words 
-12  plus  12  bits  are  accessed  by  a single  address. 

The  number  of  Microinstructions  executed  is  significantly  different  from  the  Microinstructions 
in  the  program.  Ibis  effect  is  due  to  the  execution  of  loops  in  signal  processing  applications. 
The  number  of  Microinstructions  executed  is  proportional  to  the  total  throughput  time.  Thus 
the  time  to  execute  a benchmark  is  a yardstick  for  comparison. 

The  Microinstruction  execution  time  depends  upon  several  parameters  such  as  the  logic  speed 
which  depends  upon  the  selected  technology.  The  speed  of  storage  (mcmor\  ) is  reflected  in 
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the  execution  time.  Concurrent  operation  of  Data  Addressing  and  Data  Processing  functions 
allows  a reduction  in  the  total  run  time  of  certain  algorithms.  In  algorithms,  where  the 
number  of  operand  memory  access  is  high,  the  reduction  may  be  a factor  of  2 or  more. 
Pipeline  or  array  processing  may  give  additional  speed  advantage  at  the  cost  of  additional 
hardware. 

The  clock  speed  in  a Microprocessor  depends  upon  the  technology  and  the  logic  path  through 
the  logic.  All  these  parameters  make  a comparison  extremely  difficult.  Therefore,  this  study 
attempts  to  normalize  the  parameters  for  comparison  purposes.  This  means  that  all  archi- 
tectures assume  the  same  technology  which  includes  storage  speed  as  well  as  logic  speed.  The 
normalization  factor  is  called  “cycle."  All  comparisons  are  based  upon  the  total  number  of 
cycles.  The  calculations  of  cycles  is  given  below. 

6.7.3  Tracer  and  Litton  Microinstructions 

The  Microinstructions  for  the  Tracor  and  Litton  Microprocessors  are  very  similar.  The  Micro- 
instructions have  been  symbolized  for  the  purpose  of  comparison  in  this  report.  Figure  4l> 
shows  the  Microinstruction  types.  The  types  used  for  the  benchmarks  are  classified  into  the 
following  classes: 

add,  subtract,  logic 

multiply 

square 

shifts 


M 

MEMORY 

MIX) 

MEMORY, X - ADDRESS 

MIIN) 

MEMORY,  IN  - INDEX  N ADDRESS 

MIRN) 

■ 

MEMORY,  RN-  REGISTER  N ADDRESS 

RN 

■ 

REGISTER  N,  N - 0 TO  E 

RM 

■ 

MULTIPLIER  OUTPUT 

MPY  A 

■ 

MULTIPLIER  INPUT  A 

MPV8 

■ 

MULTIPLIER  INPUT  B 
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Figure  4‘).  Microinstruction  Symbols  (Tracor/Litton) 
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Jump  operations  are  assumed  to  be  an  option  of  eaeli  Microinstruction  and  operate  in 
parallel  with  the  above  mentioned  classes. 


The  difference  between  the  Tracor  and  Litton  list  in  Figure  50  is  due  to  the  architecture. 

The  Litton  multiport  register  tile  provides  for  3 addresses  as  compared  with  the  2 addresses 
in  the  Tracor  Microprocessor.  Therefore,  several  Microinstruction  or  data  manipulation 
options  are  provided  in  the  Litton  architecture  which  are  not  able  to  be  duplicated  by  the 
Tracor  architecture. 

Two  of  the  multiply  operations  have  to  be  executed  in  two  steps  in  the  Iracor  Microproces- 
sor. This  is  due  to  the  limited  port  logic  in  the  General  Processor  Unit. 

Figure  51  shows  the  Microinstruction  timing  in  cycles.  It  is  assumed  that  a Register  to 
Register  (R/R>  operation  is  executed  in  one  cycle  which  occurs  concurrently  with  the  access 
of  the  next  Microinstruction.  A jump  instruction  takes  an  additional  cycle  only  when  the 
jump  has  been  executed.  For  example,  the  jump  may  depend  upon  the  result  of  an  opera- 
tion; if  the  condition  is  false,  then  no  jump  takes  place  and  no  additional  cycle  is  needed. 

Storage  is  normally  slower  than  the  logic  speed.  For  a simple  approach,  it  is  assumed  that 
the  speed  factor  is  two.  It  is  also  assumed  that  a special  multiplier  chip  requires  two  cycles 
to  perform  its  operation. 

Note:  Generally,  the  multiplier  speed  is  more  than  a factor  of  two  slower 
than  the  CPU  speed. 

(>.7  4 Raytheon  Liming 

The  available  documents  show  two  versions  of  the  pipeline  architecture.  One  shows  a Pipe  A 
anil  a Pipe  B.  the  other  shows  three  stages  in  the  pipeline.  The  overall  timing  of  the  pipe- 
line is  dependent  upon  the  operand  storage.  For  each  clock  period,  one  data  word  enters 
the  pipe  and  one  data  word  leaves  the  pipe. 

During  four  clock  periods,  the  memory  reads  from  two  addresses,  a double  word  each  and 
writes  two  double  words  into  two  addresses.  Buffers  coordinate  the  data  How  as  follows. 

AM  use  most  significant  12  data  bits  of  address  “A.”  AL  are  least  significant  12  data  bits 
of  address  A. 


Clock  1 

read 

AM, 

AL 

AM 

to  Pipe  In 

AL  to  buffer 

CM 

from  Pipe  Out  to  buffer 

Clock  2 

read 

BM. 

BI 

BM 

to  Pipe  In 

BL  to  buffer 

DM  from  Pipe  Out  to  buffer 
AL  to  Pipe  In 

CM  from  buffer  write  CM.  (’I 
BL  to  Pipe  In 

DM  from  buffer  write  DM.  Dl 

Note:  Address  C and  D are  delayed  by  the  sequencer  to  coincide  with  tin- 
data  which  has  been  delayed  through  the  pipe 


Clink  3 

CL  from  Pipe  Out 

Clock  4 

DL  from  Pipe  Out 
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Figure  50.  Microinstruction  Types  (Tracor  Litton) 
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Figure  51.  Microinstruction  Timing  (Tracor  Lit'on) 

Figure  52  shows  a simplified  block  diagram.  The  routing  provides  the  data  (operands  I to 
and  from  the  pipeline  as  described  above.  The  pipeline  consists  of  three  stages  Sealing. 
Multiply,  and  Accumulate.  Each  stage  operates  on  the  operands  during  four  clocks  on  four 
operands  or  the  combination  of  the  four  operands.  The  Multiply  can  perform  four  multi- 
plies in  four  clock  periods.  The  Accumulate  contains  a double  adder  with  feedback  which 
gives  4 x 2 additions  doing  each  four  clock  periods. 

It  is  assumed  that  there  is  no  time  delay  in  the  data  input  to  the  Sealing  and  that  the 
operands  are  being  programmed  to  arrive  in  the  right  sequence.  The  intermediate  results 
within  the  pipe  are  buffered  and  forwarded  appropriately.  The  output  from  the  Accumulate 
are  buffered  to  provide  the  right  sequencing  lor  writing  the  data  back  into  storage 

The  flow  through  the  pipeline  shows  four  MACRO  times.  Each  MACRO  is  shifted  through 
the  pipe  every  four  clock  times.  Since  the  clock  is  related  to  the  data  access,  one  assumes 
for  normalization  purpose  that  one  clock  equals  two  cycles.  Furthermore,  this  normalization 
is  consistent  with  the  multiplier  time  assumed  for  the  Trueor/Litton  architectures,  i.e.,  two 
cycles  per  multiply. 

A MACRO  can  be  repeated  for  several  operands  which  provides  a continuous  data  flow.  The 
next  MACRO  in  sequence  may  start  immediately  after  the  current  MACRO  has  obtained  the 
data.  However,  the  data  for  the  second  MACRO  (E  T in  Figure  521  must  be  independent 
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Figure  52.  Raytheon  Timing 
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of  the*  data  of  the  previous  data  output.  Figure  52  shows  a four  MAC  RO  delay  (Hush  of 
pipe)  when  tire  data  output  (C/O  in  Figure  5 2 > is  to  be  used  as  a next  data  input. 

Data  set-up  and  address  generation  are  assumed  to  operate  appropriately. 

0.7.5  Fast  Fourier  Transformation  (FFT) 

Coding  of  one  butterfly  for  the  FFT  will  be  shown.  The  algorithms  are  shown  on  the  coding 
sheets. 

Appendix  Bl  shows  the  Coding  for  the  Tracor  architecture.  The  Litton  architecture  uses  the 
same  coding. 

Appendix  B2  shows  the  Raytheon  Coding 

Timing  calculations  are  shown  in  Appendix  Cl  for  the  Traeor/Litton  architecture  and  in  C2 
for  Raytheon.  A 1024  point  FFT  was  assumed  which  requires  ten  passes.  The  Traeor/Litton 
architecture  uses  a single  accumulator  and  a single  multiplier  which  is  equivalent  to  a “real 
in  place  FFT.”  The  Raytheon  architecture  performs  a complex  in  place  FFT.  Adding  hard- 
ware to  the  Traeor/Litton  architecture  to  provide  “complex”  calculation  will  reduce  the  time 
by  a factor  of  2 to  4.  (See  Section  111.) 

Note,  the  results  are  tabluated  in  Figure  53. 
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Figure  53.  Microprocessor  Cycles  for  Benchmarks 
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0.7.0  Coordinate  Com  ersion 

There  are  two  parts  to  the  coordinate  conversion. 

a.  Polar  to  Rectangular 
h.  Rectangular  to  Polar 

Appendix  Bo  shows  the  Polar  to  Rectangular  Coding  tor  the  Tracor  architecture.  The  coding 
takes  in  account  in  which  quadrant  the  angle  is  located  The  trigonometric  functions,  sine 
and  cosine,  use  an  approximation  for  an  angle  of  less  than  n 4.  The  sign  of  sine  and  cosine 
are  selected  according  to  the  quadrant.  The  Litton  coding  differences  are  indicated  and  slum 
a reduction  of  coding  due  to  the  multiply  function. 

Appendix  B4  shows  the  Polar  to  Rectangular  Coding  for  the  Raytheon  architecture  1'tie 
simplicity  of  the  coding  is  due  to  the  built-in  hardware  function  which  directly  provides  the 
trigonometric  function.  Due  to  the  hardware  duality  in  the  pipeline,  two  data  points  can  he 
converted  at  the  same  time  in  a single  macro.  It  should  be  noted  that  the  coding  assumes 
that  the  pipeline  receives  tirst  quadrant  data.  If  the  angle  must  be  tested,  a severe  penalty 
must  be  paid  to  perforin  one  or  more  data-dependent  branches. 

Appendix  C'3  shows  the  Polar  to  Rectangular  liming  for  the  Tracor  and  Litton  architecture. 
The  timing  is  for  a single  point.  The  angle  is  assumed  to  be  in  the  second  quadrant.  The 
path  through  the  coding  is  shown.  The  I itton  architecture  uses  fewer  instructions  as  indi- 
cated In  the  # sy  mbol. 

Appendix  14  shows  the  Polar  to  Rectangular  Timing  for  the  Raytheon  architecture.  Again, 
a single  point  conversion  is  assumed.  Since  2 conversion  per  macros  are  performed,  it  is 
reflected  in  the  timing  calculation. 

Appendix  B5  shows  the  Coding  tor  Rectangular  to  Polar  conversion  tor  the  Tracor  and  I itton 
architecture  I he  square  root  t$0'  computation  uses  an  estimation  algorithm  Note,  the 
coding  tests  foi  negative  numbers  which  make  the  coding  applicable  as  a general  routine. 
Negative  numbers  are  converted  tv'  positive  numbers. 

The  calculation  of  the  angle  uses  an  approximation  of  the  arc  sine  tsin'1!  function.  This 
function  is  similar  to  the  sine  function  and  differs  only  in  the  constants  ysee  Bel  \ test 
of  the  coordinates  is  performed  to  determine  whether  the  angle  will  be  smaller  or  larger  than 
tr  4 in  a quadrant.  Alter  calculation  o!  the  angle,  the  appropriate  quadrant  is  being 
determined. 

I he  algorithm  requires  divisions  Since  the  multiply  chip  does  not  contain  the  divide  tunc 
tion.  .i  separate  approximation  is  being  used  which  computes  more  than  two  bits  of  the 
quotient  per  iteration  Since  the  divide  is  a general  subroutine,  several  tests  are  being  made 
to  determine  whether  the  dividend  is  zero  or  the  divisor  is  zero  Dividend,  Divisor  and 
Quotient  are  passed  by  Registers  I.  and  0 respectively  Other  registers  being  used  arc 

appropriately  saved  restored 

Appendix  Bo  shows  the  Rectangular  tv'  Polar  (i niing.  toi  the  Raytheon  aichitecture  The 
angle  is  determined  directly  by  the  built  in  trigonometric  function  assuming  it  will  give  the 


appropriate  quadrant.  The  R is  determined  by  a multiplication  and  addition  rather  than  by 
a square  root.  It  takes  advantage  of  the  built-in  hardware  which  provides  trigonometric 
functions.  Unlike  the  polar-to-rectangular  conversion,  the  angle  estimation  hardware  does 
not  need  quadrant  information  for  this  conversion.  This  function  provides  a real  and 
significant  performance  improvement. 

Appendix  C5  shows  the  Rectangular  to  Polar  Timing  for  the  Tracor  and  Litton  architecture. 
Appendix  C6  shows  the  Rectangular  to  Polar  Timing  for  the  Raytheon  architecture. 

6.7.7  Constant  False  Alarm  Rate  (CFAR) 

The  Constant  False  Alarm  Rate  (CFAR)  is  a sliding  window  CFAR  benchmark.  It  assumes 
a 256  call  window.  The  range  contains  4096  cells.  Each  cell  assumes  a six  bit  positive 
value  from  the  A-to-D  conversion  stage. 

Appendix  B7  shows  the  CFAR  coding  for  the  Tracor/Litton  architecture.  The  sliding  window 
is  defined  by  the  Indices  II  and  12.  The  midpoint  is  14.  The  threshold  decision  is  one  bit. 
Each  16  consecutive  decisions  are  packed  into  one  word  (Index  15).  The  resulting  word  is 
stored  by  Index  13.  Operation  is  in  real  time.  The  coding  takes  full  advantage  of  the 
indexing  capabilities  of  the  Data  Addressing  function. 

Appendix  B8  shows  the  CFAR  coding  for  the  Raytheon  architecture.  The  first  macro  accumu- 
lates the  window,  four  bits  at  a time.  The  second  MACRO  updates  the  window  for  the  next 
two  cells;  this  is  a pass  over  the  whole  range.  After  flushing  the  pipe,  the  next  macro  per- 
forms two  decisions.  Note,  the  decisions  are  not  packed.  The  sign  of  each  word  represents 
the  decision. 

Fite  operation  is  not  in  real  time  since  the  range  has  to  be  processed  twice  requiring  large 
intermediate  operand  storage.  If  there  would  be  a feedback  from  the  accumulator  into  the 
multiplier,  then  the  accumulator  would  be  available  internally  to  the  pipe.  In  other  words, 
the  storing  of  the  accumulation  “S”  would  be  eliminated.  Each  decision  would  be  made  from 
the  internal  accumulation  with  an  accessed  midpoint.  The  accumulator  would  be  updated  from 
2 range  cells.  This  scheme  requires  clever  arrangement  of  the  range  cells.  Program  execution 
would  alter  between  a pair  of  macros  after  1 28  decisions  each. 

Appendix  C7  shows  the  CFAR  Timing  for  the  Tracor/Litton  architecture.  Note,  the 
processing  is  in  real  time  and  each  16  decisions  are  packed  into  a word. 

Appendix  C8  shows  the  CFAR  Timing  tor  the  Raytheon  architecture.  Processing  is  not  in 
real  time  and  decisions  are  each  in  a separate  word. 

Appendix  C9  shows  the  CFAR  Timing  for  the  Tracor/Litton  architecture.  This  coding  and 
timing  assumes  that  each  decision  is  stored  in  a separate  word.  Note,  operation  is  still 
in  real  time,  but  coding  is  shorter  and  therefore  the  timing  has  been  reduced  significantly 
as  compared  with  Appendix  C7. 

(>.7.8  Microprocessor  Cycles  for  Benchmarks 

Figure  5-3  tabulated  the  microprocessor  cycles  lor  each  benchmark  and  for  each  microprocessor 
architecture.  A discussion  of  the  results  follows  in  the  next  section. 
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6.8 


COMPARISON  OF  RESULTS 


The  comparisons  of  microprocessor  architectures  in  this  section  are  based  upon  the  results 
presented  in  the  section  5.8.  The  results  are  based  upon  the  evaluation  of  three  micro- 
processor architectures  and  its  application  on  signal  processing  benchmarks.  The  evaluations 
are  based  upon  available  documentations.  Some  of  the  documentation  may  be  voluminous 
but  lacking  of  necessary  details  to  allow  sufficient  analysis.  Therefore,  in  many  instances, 
assumptions  have  been  made  assuming  the  good  guesses  are  right.  The  lack  of  details  and 
consistency  in  the  available  documents  may  be  due  to  the  state  in  which  the  presented  archi- 
tectures were  at  the  time  of  their  publishing.  This  reflects  in  the  appearance  of  the  architec- 
tures to  be  conceptual  rather  than  designed  or  being  implemented. 

An  effort  has  been  made  to  evaluate  the  architectures  in  the  best  light  and  to  be  rather 
optimistic  than  pessimistic.  Parameters  have  been  normalized  to  provide  a fair  comparison. 
Despite  adverse  circumstances,  significant  discoveries  have  been  made.  Further  studies  and 
subsequent  comparisons  may  make  use  ot  these  tacts  and  a refinement  of  the  analysis,  evalu- 
ation and  comparison  of  results  may  be  achievable. 

6.8.1  Architecture  Comparison 

The  comparison  ot  the  Tracor,  Raytheon  and  Litton  microprocessor  architectures  showed 
the  following: 

a.  The  Tracor  and  Litton  architectures  are  very  similar;  the  Raytheon 
architecture  is  different.  Tracor  and  Litton  architectures  are  readily  enhanced 
to  reflect  an  array  processor  architecture,  similar  to  Raytheon  s architecture. 

b.  All  three  architectures  have  a controller/sequcncer  which  fetches  microinstruc- 
tions and  operates  concurrent  with  microinstruction  execution. 

c.  All  three  architectures  have  data  addressing/address  generator  hardware  which 
operates  concurrent  with  the  data  processing/pipeline. 

d.  All  three  architectures  have  a multiply  hardware  as  a special  function.  The 
Raytheon  architecture  reflects  the  State-of-the-Art  “optimum”  in  this 
respect,  i.e.  a multiplier  followed  by  an  accumulator.  That  function  reduces 
the  data  traffic  on  the  data  bus  which  is  extremely  important  in  signal 
processing.  (See  multiplier  discussion  in  chapter  3) 

e.  The  Tracor/Litton  architecture  have  a single  Data  Processor  and  operate  on  a 
single  operand  (data  point)  at  a time.  Raytheon  has  a parallel  Pipeline  which 
can  operate  on  two  operands  (data  points)  at  a time  due  to  its  dual  data 
path  and  dual  arithmetic.  Therefore,  more  hardware  provides  for  apparent 
higher  throughput  as  compared  with  the  Tracor/Litton  architecture. 

f.  It  is  anticipated  that  a Tracor/Litton  array  processor  architecture  provides 
a speed  advantage  of  2 to  4 over  a single  Data  Processing  architecture 
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g.  The  Litton  CPU  chip  as  compared  with  the  Tracor  CPU  chip  has  more  ports  on  the 
chip;  and  these  ports  are  bidirectional.  Thus,  fewer  microinstructions  are  required  to 
route  data,  and  more  powerful  microinstructions  are  provided  in  the  repertoire.  This 
is  reflected  in  the  benchmark  coding.  Furthermore,  operation  as  a data  addressing 
chip  is  enhanced  because  additional  processing  of  literals  is  obtained. 

h.  The  Litton  CPU  has  Multi  Port  Registers  (MPR)  with  three  addresses  and  performs 
three  operations  on  all  three  addresses.  Litton  can  read  from  two  addresses  and 
write  into  a third  address.  Tracor  has  a two  address,  three  operation  MPR.  Tracor 
can  read  from  two  addresses;  one  of  those  addresses  can  be  used  to  write  data  back 
into  the  MPR.  The  three  address  feature  in  the  Litton  CPU  did  not  provide  an 
advantage  in  the  given  benchmark  coding;  however,  the  advantage  is  most  significant 
in  Data  Addressing. 

i.  Special  divide  algorithm  shows  not  a significant  improvement  over  conventional 
algorithms  due  to  a large  overhead  in  determination  of  special  cases  of  dividend  and 
divisor. 

6.8.2  Timing  Comparison 

Figure  54  is  a summary  of  the  benchmark  comparison  expressed  in  cycles  (normalized).  This 
figure  also  shows  estimates  for  array  architectures.  The  speed  improvement  of  an  array  archi- 
tecture over  a single  Data  Processing  architecture  is  assumed  to  be  a factor  2 to  4. 
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RECT  TO  POLAR 

323 

16 

312 

106 

53 

CFAR 

BIT  PACKED 

63.499 

- 

63.499 

31.749 

15.876 

NOT  BIT  PACKED 

37,897 

33.328 

37,897 

18.948 

9.474 

78453  56 


Figure  ‘'4.  Microprocessor  Cycles  for  Benchmarks 
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Figure  54  shows  the  following: 

a.  Litton  as  compared  with  Tracor  timing  shows  a small  advantage  due  to  the  micro- 
instruction repertoire  which  requires  fewer  executions. 

b.  Raytheon  as  compared  with  the  Tracor/Litton  Timing  shows  to  be  superior  on  the 
surface.  Therefore,  a detailed  comparison  follows. 

c.  FFT  Raytheon  is  significantly  better  than  Tracor/Litton  timing.  When  compared 
with  an  array  architecture,  i.e.,  equivalent  or  less  hardware,  the  timing  is  about 
equivalent. 

d.  Coordinate  Conversion/Polar  to  Rectangular  The  array  architecture  liming  is  com- 
petitive with  Raytheon. 

e.  Coordinate  Conversion/Rectangular  to  Polar  the  array  architecture  is  significant!) 
slower  than  Raytheon.  Raytheon's  speed  advantage  is  due  to  the  hardware  built-in 
trigonometric  special  functions  to  avoid  data-dependent  operations. 

f.  CFAR  The  Tracor/Litton  bit  packed  algorithm  is  a factor  of  two  slower  as  com- 
pared with  the  non-bit  packed  algorithm.  The  trade-off  is  speed  vs  storage  require- 
ment. Comparing  the  non-bit  packed  Raytheon  timing  with  the  l'racor/Litton 
timing  shows  about  the  same  figures.  However,  an  array  architecture  is  much  faster 
than  the  Raytheon  timing.  A bit  packed  Raytheon  algorithm  would  be  even  slower 
because  the  algorithm  necessitates  data-dependent  branches.  1 iris  shows  the  Raytheon 
architecture  is  not  geared  for  this  type  of  application. 

g.  Special  functions  such  as  trigonometric  hardware  provides  a speed  advantage. 

h.  Divide  function,  if  required,  in  the  Raytheon  architecture  would  be  slow. 

i.  Divide  function  should  be  incorporated  into  the  multiply  chip. 

j.  Importance  of  the  Data  Addressing  function  has  been  shown  in  the  CFAR  benchmark 
and  can  be  shown  in  the  total  FFT.  Several  index  operations  are  performed  concui 
rent  with  Data  Processing. 

b.8.5  Conclusions 

a.  Special  purpose  processor  is  faster  than  general  purpose  processor,  i.e.,  a processoi 
which  has  built-in  trigonometric  functions. 

b.  Special  purpose  processor  is  inflexible  as  compared  with  general  purpose  processoi 
i.e.  CFAR.  divide,  non-first  quadrant  angles. 

c Raytheon  computer  is  a type  of  array  processor  i.e.,  2 data  path  in  2 pipe  stages 
= 4 processor  equivalent. 

d.  Array  processors  are  taster  than  single  processors  because  of  parallel  processing 
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e.  Divide  in  the  Raytheon  computer  is  extremely  slow.  Square  root,  if  necessary,  in 
the  Raytheon  computer  has  an  unknown  implementation. 

t.  Raytheon  should  have  feedback  from  the  adder  to  the  multiplier  to  improve  its 
ability  to  process  the  CFAR. 

g.  Multiply  chip  should  include  divide  to  significantly  speed-up  processing.  How  often 
the  divide  is  needed  is  unknown? 

6.9  REFERENCES 

a.  Critical  Item  Development  Specification  for  General  Processor  Unit,  Performance/ 
Design  Requirements  and  Technical  Description.  Appendix  D,  Preliminary  CDR1. 
E003.  Prepared  for  Space  and  Missile  System  Organization  Air  Force  System 
Command.  Contract  No.  N0024-73-C-1 13 1 , Modification  P00004,  prepared  by 
Tracor,  Inc.,  6500  Tracor  Lane,  Austin,  Texas,  78721,  23  April  1974,  Document  No 
T74-AU-9073-U. 

b.  MC2901.  Four-Bit  Bipolar  Microprocessor  Slic,  includes  “using  the  MC  2901”, 
“MC2909  Microprogram  Sequencer”,  and  “using  the  MC2909”  by  Motorola. 

c.  Microprogramming  ups  your  options  in  up-System  Design,  port  2.  by  Jim  Bride  and 
John  Mick,  Advanced  Micro  Device,  EDN  February  5,  1978,  pages  53-61. 

d.  High  Speed  Micro  Signal  Processor  Study  Final  Report  Draft.  Prepared  for  Air 
Force  Avionics  Laboratory,  United  States  Air  Force,  Wright-Patterson  AFB,  Ohio. 
Contract  No.  F33615-76-C-1339/CLIN:0003,  Prepared  by  Raytheon  Company, 

Missile  Systems  Division,  Bedford  Laboratories,  Bedford  Mass.  January  1977 
BR-9633. 

e.  Initial  Briefing  to  WPAFB/AFAL  for  High  Speed  Micro  Signal  Processor.  Contract 
F336J5-77-C-J224,  AF  Raytheon  Company,  Bedford,  Massachusetts,  7 December 
1977. 

f.  Proposal  for  High  Speed  Microsignal  Processor,  Volume  I Technical.  Submitted 

to  Unit  States  Air  Force,  Air  Force  Systems  Command,  Aeronautical  Systems 
Division/PPMEA,  Wright-Patterson  AFB,  Ohio  45433.  In  response  to  solicitation 
No.  F33615-77-R-1224.  Prepared  by  Data  Systems  Division,  Litton  Systems,  Inc., 
8000  Woodley  Avenue,  Van  Nuys,  California  91409.  21  July  1977,  MS  77357-1. 

g.  Current  Study  Contract 


129 


- 


■ 


SECTION  VII 

CONCLUSIONS  AND  RECOMMENDATIONS 


7.0  INTRODUCTION 

The  attempt  to  design  a single  large  scale  integrated  circuit,  the  Multimode  CPU  has  revealed 
some  interesting  insights  into  digital  signal  processor  design,  LSI  technology  and  the  signal 
processing  problem  set.  Analysis  has  shown  that  the  Data  Processing,  Data  Addressing  and 
Instruction  Address  functions  are  all  within  the  reach  of  LSI. 

In  this  chapter,  conclusions  will  be  drawn  and  recommendations  will  be  made  in  the  area  of 
the  Benchmarks,  the  Architecture,  the  Technology,  and  the  Comparison.  The  conclusions, 
presented  herein,  am  supported  in  the  preceding  chapters. 

7.1  CONCLUSIONS 

7.1.1  Benchmarks 

The  problem  set  in  Section  11  was  included  to  be  representative  of  the  signal  processing  tasks 
required  presently  and  through  approximately  1 090.  The  problem  had  to  be  bounded  so 
that  the  MMCPU  could  be  designed  to  span  a wide  range  of  applications  and  still  be  “spe- 
cialized” enough  to  handle  the  unique  requirements  of  the  problem  set.  The  tasks  can  be 
separated  into  high  speed  and  low  speed  requirements.  The  FIT  along  with  the  weighted 
FIT  and  cosine  transform  are  extremely  high  speed  computation  problems.  The  pulse  clas- 
sification algorithm  requires  a high  speed  computation  but  more  importantly,  a high  speed 
data  dependent  testing  capability.  The  other  problems  are  lower  speed  and  will  not  be  dis- 
cussed here. 

The  EFT  and  the  classis  Cooley-Tukey  butterfly  have  a very  orderly  and  repetitive  arithmetic 
flow  and  a simple  addressing  scheme.  From  the  analysis  of  multiplier  structures  in  Section  III 
it  is  concluded  that  the  optimum  processor  structure  for  the  FFT  is  a hardware  special 
function  unit  which  performs  all  the  butterfly  arithmetics.  Furthermore,  because  the  address- 
ing requires  simple  additions  and  tests,  a fairly  simple,  but  high  speed,  general  purpose  RALU 
or  CPU  is  required  to  support  the  special  function  unit. 

The  pulse  classification  algorithms  is  virtually  at  the  other  end  of  the  processing  spectrum. 
Although  the  operand  set-up  requires  a repetitive  set  of  adds  and  multiplies  for  calculating 
the  distance  measure,  the  bulk  of  the  processing  (about  23  percent  by  actual  operation 
count)  is  involved  in  testing  and  selecting  a branch  path  from  the  outcome  of  the  test.  It  is 
concluded  that  a very  sophisticated,  high  speed,  general  purpose  CPU  is  required  with  a mul- 
tiplier to  support  it. 

The  emphasis  of  the  CPU  and  the  special  function  unit  (butterfly  unit  or  multiplier)  is  re- 
versed. Part  of  the  objective  of  this  program  was  to  determine  if  a single  rchitecture  could 
accomplish  this  task.  It  is  concluded  that  a single  architecture  would  be  somewhat  inefficient 
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hut  a single  CPU  structure  capable  of  handling  the  pulse  classification  problem  can  be  defined 
which  would  be  more  than  sufficient  for  the  remaining  tasks  in  the  problem  set. 


7.1.2  The  Architecture 

To  support  the  high  I/O  rate  of  the  FFT  butterfly,  the  addressing  unit  must  supply  addresses 
to  the  data  memory  so  that  operands  can  be  read  or  written  at  a high  rate;  therefore,  the 
bus  system  must  support  the  FFT  speed  requirements.  In  Section  111,  the  bus  system  was  de- 
fined as  a result  of  the  problem  set  and  became  the  prime  concern  tor  supporting  the  multi- 
pliers and  RALUs.  It  is  concluded  from  a top-down  viewpoint  that  the  processing  system 
should  be  built  around  the  maximum  bus  necessary  for  the  job.  By  defining  the  bus  first, 
the  speed  requirements  for  the  processing  elements  or  the  speed  limitations  of  the  processor 
imposed  by  the  bus  are  clearly  established. 

After  the  bus  requirements  were  established,  the  multiplier  structure  was  investigated  in  rela- 
tion to  the  problem  set.  It  is  concluded  that  the  processor  structure  is  highly  dependent  on 
the  multiplier  structure  as  evidenced  by  the  two  DP/DA  structures  presented  in  Section  111. 
For  a maximization  to  the  “general  purpose”  goals  of  the  MMCI’II,  the  Multiplier/FFT  struc- 
ture is  preferred  because  it  handles  the  high  speed  problems  of  the  FFT,  weighted  FFT  and 
Cosine  Transform,  as  well  as  the  other  problems  in  Section  11. 

Array  processing  is  generally  a difficult  procedure  because  bus  transfers,  resource  sharing,  etc. 
is  difficult.  The  architecture  presented  was  developed  with  a desire  to  expand  via  array  proc- 
essing so  that  the  processing  speed  could  be  increased.  The  limitations  of  bus  transfers, 
resource  sharing,  and  timing  were  resolved  by  allowing  only  nearest  neighbor  intercommuni- 
cation and  several  I/O  Data  Ports;  furthermore,  the  processors  must  be  operated  in  lock-step 
or  time  synchronism  if  the  array  approach  will  work  with  maximum  efficiency. 

7.1.3  The  Architectural  Comparison 

Using  the  constraints  of  Section  VI,  several  significant  points  can  be  concluded  about  the 
Tracor/RC'A  processor  (GPU),  the  Raytheon  processor  (Micro-Signal  Processor),  and  the 
Litton  processor  (MMCPU). 

The  GPU  is  similar  to  the  2901  processor.  It  is  excellent  for  general  purpose  problems  in- 
cluding emulation.  The  RAl.U  structure  provides  great  flexibility  of  operation;  however,  the 
limited  number  of  I/O  ports  inhibits  the  bus  interconnection  flexibility  necessary  for  signal 
processing.  Furthermore,  the  I/O  limitations  make  array  processing  viitually  impossible  be- 
cause the  data  buses  must  be  tied  to  each  other,  thereby,  forcing  a battle  for  bus  usage. 

The  Micro-Signal  Processor  is  designed  specifically  for  signal  processing  problems.  It  will  han- 
dle the  FFT,  weighted  FFT,  and  Cosine  Transform  well  because  the  pipeline  structure  is 
oriented  toward  the  FFT.  It  resembles  the  multiplier/accumulator  with  holding  registers  dis- 
cussed in  Section  111.  The  major  weakness  of  the  architecture  is  the  lack  of  provision  for 
data  dependent  operations,  therefore,  FW  problems  arc  beyond  its  scope.  Any  data  depend- 
ent testing  must  be  done  in  the  sequencer,  and  the  result  must  wait  for  the  FIFO  to  clear 
before  it  can  be  implemented.  The  primary  problem  is  the  overdependence  on  the  pipeline 
for  all  arithmetics. 
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The  MMCPU  is  similar  to  the  GPU  and  the  2901  RALUs;  therefore,  it  is  extremely  capable 
of  the  general  purpose  problems.  It  has  enough  ports  available  to  give  it  signal  processing 
flexibility  in  a single  or  array  configuration.  The  major  limitation  to  this  processor  is  the 
multiplier.  Current  commercially  available  multipliers  are  not  responsive  to  the  FFT  needs. 
Without  such  a inultiplier/FFT  unit,  this  processor  is  greatly  limited  for  the  FFT  type  prob- 
lem; however,  the  FW  problems  are  well  within  reach. 

A final  point  should  be  made.  The  pipeline  arithmetics  of  the  Raytheon  processor  along 
with  the  MMC'PU  for  general  processing  and  data  dependent  operations  would  be  a powerful 
processor  configuration. 

7.1.4  The  Technology  and  the  MMCPU 

In  section  5.3,  CMOS/SOS,  l-^L,  and  VMOS  were  analyzed,  and  each  is  capable  of  high  gate 
count  LSI.  Although  the  1500  to  1750  gate  8-bit  RALUs  appear  to  be  a limit  for  the  tech- 
nologies, any  of  these  technologies  could  perform  well  now  or  in  the  near  future. 

The  final  question  remains.  Is  the  MMC'PU  chip  concept  feasible?  If  so,  in  what  context? 

To  answer  this  question:  the  gate  count  must  be  estimated.  Using  the  RAlAJs  as  a basis, 
the  microsequences  and  loop  counter  could  be  accomplished  by  using  the  MPR/ALU  func- 
tions of  the  RALUs.  Approximately  200  additional  gates  would  be  necessary  in  the  MPR  so 
that  it  could  perform  as  a lb  won!  by  eight  bit  register  file  and  as  an  8 word  by  12  bit 
LIFO  stack. 

The  interrupt  control  unit,  flag  logic,  and  instruction  addressing  instruction  decode  would 
have  to  be  included  on  the  single  chip.  Another  500  gates  would  be  necessary.  Because  the 
1A  made  is  significantly  different  than  either  the  UP  or  DA  mode,  additional  gates  would  be 
necessary  to  permit  multiple  modes  at  the  I/O  data  ports  as  well  as  to  allow  12  bit  instruc- 
tion addresses  to  be  generated  instead  of  8 bit  data  I/O.  Lastly,  the  internal  chip  buses 
would  have  to  be  structured  to  accommodate  8 bit  data  and  12  bit  instruction  addresses. 

The  total  estimate  for  an  MMC'PU  is  between  2300  and  3000  gates. 

From  a commercial  point  of  view,  the  single  chip  is  inefficient,  requiring  a significant  amount 
of  the  chip  to  be  unused  in  various  modes.  Unused  portions  of  a chip  are  costly  because 
when  the  unused  portions  of  the  chip  are  stripped  away,  the  chip  is  smaller,  the  yield  is 
higher  and  the  cost  is  lower.  However,  from  a military  point  of  view,  a single  chip  type  may 
offset  the  cost  of  unused  portions  of  the  chip.  A single  chip  type  reduces  the  number  of 
types  that  must  be  supplied.  Lower  life  cycle  cost  can  be  aided  by  such  reductions. 

A single  chip  might  be  advisable  to  the  military;  unfortunately,  the  problem  set  requires  high 
speed  gates  be  utilized  to  perform  the  FFT  and  l-’W  problems.  As  previously  discussed, 
higher  speed  means  higher  gate  dissipation.  Power  is  the  major  limitation.  The  total  chip 
power  dissipation  would  be  greater  than  3 Watts  for  any  of  these  technologies.  The  only 

way  the  MMCPU  chip  would  be  feasible  is  if  the  unused  functions  on  the  chip  were  not 

powered.  Such  a scheme  is  possible,  but  generally  not  practical. 

It  is.  therefore,  concluded  that  the  MMCPU  chip  concept  is  not  feasible  in  today’s  technol- 
ogy. A two  chip  type  system  one  a DP/DA  RALU  chip,  the  other  an  IA  controller  chip 

would  satisfy  the  needs  of  the  complex  processors  discussed  in  Section  111. 


132 


At  the  present  rate  of  increase  in  technology,  the  single  chip  concept  remains  two  to  three 
years  away  for  high  speed  applications.  Lower  speed  chips  are  possible  today  which  may 
mean  that  a lower  speed  version  could  be  developed  and  utilize  the  array  processing  concept 
to  perform  the  higher  speed  problem. 

7.2  RECOMMENDATIONS 

7.2.1  Electronic  Warfare 

The  Electronic  Warfare  problem  was  briefly  discussed  in  Section  II  as  one  of  the  benchmarks 
for  the  MMCPU.  The  ultimate  solution  was  not  presented  and  is  not  totally  known.  The 
solution  given  herein  is  a fairly  simple-minded  approach,  assuming  all  the  emitter  data  para- 
meters are  the  same  word  length.  In  truth,  the  word  lengths  are  greatlv  different,  and 
weighting  would  be  necessary  to  “standardize”  the  word  lengths  for  processing. 

The  pulse  classification  algorithm  has  a very  repetitive  distance '-measure  calculation  using  the 
parameters.  Array  processing  should  be  explored  as  a means  of  greatly  increasing  the  speed 
of  the  simple  calculation  via  parallelism.  A possible  solution  is  simple  hardware  for  the  cal- 
culation and  an  MCCPU  for  probability  comparisons.  More  study  of  architectures  is  needed 
in  this  area. 

7.2.2  Array  Processing 

Array  processing  has  been  presented  herein  in  a very  limited  manner  The  approach  given  is 
a cross  between  the  full  parallel  processor  and  the  multiprocessor.  The  major  area  of  appli- 
cation for  the  nearest-neighbor  approach  is  very  structured  problems  such  as  signal  processing. 
The  concept  needs  more  study  in  two  areas.  1)  Slower  processors  are  possible  if  more  ar- 
raying can  be  efficiently  done.  2)  Processor  speed  increases  may  be  possible  without  obviat- 
ing sottware  for  the  slower  processer  because  one  processor  simply  works  twice  as  fast  as  its 
predecessor,  thereby  doing  the  work  of  two.  Both  areas  seem  quite  fruitful. 

7.2.3  Demonstration  of  MMCPU 

A practical  demonstration  of  the  MMCPU  to  demonstrate  the  concept  and  to  study  the  array- 
ing possibilities  in  EW  and  other  applications  is  necessary  to  “prove  the  concept.”  The  dem- 
onstration processor  could  be  built  of  2l>00  series  parts  very  easily  because  the  MMCPU  is 
very  similar  to  the  RALU’s  and  microsequences  of  that  series.  The  necessary  speed  could  not 
be  simulated  but  the  function  could  be  proven. 

7.2.4  VMOS  Technology 


Lastly,  the  VMOS  technology  should  be  carefully  watched  as  a potential  LSI  signal  processing 
technology.  Although  it  is  an  NMOS  variation,  many  of  the  temperature  range  problems  seem 
to  be  ameliorated.  With  the  possibility  of  miniscule  channel  lengths  using  standard  geometry 
rules,  this  technology  has  the  potential  of  outdistancing  every  available  technology  in  the  LSI 
field. 
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APPENDIX  A 


TIMING  BENCHMARKS  FOR  COMPLEX  PROCESSORS 

TASK.  PERFORM  BUTTERFLY  FOR  FFT. 

Algorithm:  A + B are  input  points 

X + Y are  output  points 

TR  = BR*CO  - BI*SI 
TI  = BR*SI  + BI’CO 
XR  = AR  + TR 
XI  = A1  + Tl 
YR  = AR  - TR 
Y1  = AI  Tl 
Alternate  Algorithm: 


TRj  = 

BR*CO 

XR  j = 

AR  + TR, 

YR, 

= AR  - TR1 

tr:  = 

BI*SI 

XR  = 

xri  - tr: 

YR 

= YR 1 + tr: 

I'll  = 

BR*S1 

XI,  = 

AI  + Tl, 

Yl, 

= AI  - Tl, 

TI  2 = 

Bl*CO 

XI  = 

XI , + TIi 

YI  i - 

Tli 
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PROCESSOR  1 


Task  1 : 1 

FFT 

Butterfly 

Coding: 

Timing  (in  cycles) 

R 1 

= 

M(B) 

R2 

= 

M(C) 

(2)  if  necessary 

R3  R 

= 

R1R*R2R 

2 

R3I 

= 

R 1 R*R2I 

R4R 

= 

R 11*R2* 

R41 

= 

R U*R2R 

R3 

= 

R3  + R4 

1 

R0 

= 

M(A) 

M(X) 

= 

R0  + R3 

*> 

M(Y) 

= 

Ro  - R3 

-y 

17  09) 

Total  Timing: 

4097  Butterflies  require  loading  of  A and  B 

10_3  Butterflies  require  loading  of  A,  B,  and  C the  new  rotation  vector 
Total  Cycles  = 17  X 4097  + 19  X 1023 

= 89086  cycles 

PROCESSOR  2 

Task  1 : FFT  Butterfly 
Coding: 

MPYL I = M(C) 

MPYL2  = M(B) 

MPYL3  = M(A) 

COMPLEX  MPY  = LI *L2 
COMPLEX  ADD  L3+CMPY 
L3-CMPY 

M(X)  = CADD 
M(Y)  = CSuB 


Timing  (in  cycles) 

(1)  if  necessary 

1 

1 

1 


8/9 


Total  Timing: 

4097  Butterflies  require  loading  of  A and  B 

1023  Butterflies  require  loading  of  A,  B.  and  C the  new  rotational  vector 
Total  Cycles  = 8 X 4097  + 9 X 1023 
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TASK:  PULSE  CLASSIFICATION 


Algorithm: 

ERROR 


RO. 

= 

M(BI) 

get  bi 

RI 

= 

MC(BJI) 

get  bji 

R2 

= 

RO  - Rl 

R2 

= 

R2*R2 

(bi-bji)2 

R2 

= 

M(SJI)*R2 

(bi-bji)2 

R3 

= 

R3  + R2:  lO  = 10  + 1 : JUMP  ERROR  IF  lO  NE  11 

R4 

= 

R3  shift  right 

EJ  = -EJ/2 

R5 

= 

M(R4) 

Memory  look-up  exp  (R3) 

[Could  calculate  the  exponential] 

R5 

= 

R5*M  (SPJ) 

M(SPJ)  = Pj 

ALu 

= 

R5  - R6:  JUMP  JCOUNT  IF 

ALu  0 

R6 

= 

R5  M(I5)  = 12,  15  = 15  + 

1 Store  Cj 

JCOUNT 

JUMP  THRES  IF  IR 

= JMAX 

12  = IR+1  JUMP  ERROR 


THRES  ALU  = R5  - M(T);  PC  = PC+2  IF  ALU  0 

M(T)  is  the  threshold 


JUMP  ERROR 


1 
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PROCESSOR  I AND  II 


Task.  Pulse  Classification 

Coding:  Timing  (in  cy 

CLR  MPR  1 

10  = 0 i 

11  = 1MAX 

12  = JSTART  1 

13  = JMAX 

14  = JDELTA  1 

16  = THRESHOLD  1 

7 

ERROR  R0  = M(I0)  : 15  = 2*10  i 

R1  = M(I5  + 12+1)  2 

R2  = R0  - R1  J 

R2  = R2*R2  -> 

R2  = R2*M*I5  + 12+1)  3 

R3  = R3  + R2;  10  = 10  + I -.JUMP  ERROR  1/2 


IF  I0.NE.II  11/ 1 2 

15  = -R3  Shift  Right  ? 

R2  = M(I5)  -> 

R2  = R2*M(I2)  3 

ALU  = R5-R6  : JUMP  COUNT  IF  ALU  0 2/3 


9/10 

R6  = R5  :I6  = 12  ■> 

JCOUNT  JUMP  THRES  IF  12  = IMAX  \P 

:I2  = 12  2 

5/6 

THRES  ALU  = R5  - M(I2  + 16)  :SKIP  IF  ALU  0 2/3 

JUMP  ERROR  1 


3/4 

STORAGE  ROUTINE  END  OF  ALGORITHM 
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TOTAL  TIMING: 


SETUP  - 7 cycles 
CLASSIFICATION 

Assuming  100  possible  classes 
ERROR  ROUTINE 

Assuming  4 parameters  to  calculate  distance 

4 total  passes  per  class 

3 require  jumps 
1 requires  no  jump 

Total  per  class  = 11  x 1 x 12  x 3 = 47 
TOTAL  = 100  X 47  = 4700  cycles 

PROBABILITY  ROUTINE 

Each  class  requires  14/13  cycles  depending  on  jumps. 
Assuming  50^1  require  jumps 
Final  pass  requires  2 jumps 

TOTAL  = 14  x 50  + 13  x 50  + 1 = 1351  cycles 
THRESHOLD  ROUTINE 

Entered  only  once  per  classification 
TOTAL  = 3 cycles 

TOTAL  = 7 + 4700  + 1351  + 3 = 6061  cycles 


APPENDIX  B 


BENCHMARKS  CODING 


1 FFT,  Tracor/Litton 

2 FFT,  Raytheon 

3 Coordinate  Conversion  A.  Tracor/Litton 
Polar  to  Rectangular 

4 Coordinate  Conversion  A,  Raytheon 
Polar  to  Rectangular 

5 Coordinate  Conversion  B.  Lracor/Litton 
Rectangular  to  Polar 

f>  Coordinate  Conversion  B,  Raytheon 
Rectangular  to  Polar 

7 CFAR,  bit  packed.  Tracor/Litton 

8 ('FAR.  non  hit  packed,  Raytheon 

B1  ITT,  Tracor/Litton  Coding 

1.  Benchmark:  FFT  1024  Points 

2.  Algorithm 

TR  = CO*BR  - SI*Bl 
TI  = S1*BR  + CO*Bl 

XR  = AR  + TR 
XI  = A1  + TI 


YR  = AR  - TR 
Yl  = Al  - TI 

3.  Coding 

l a he  Is  Instructions: 

R 1 = M(CO) 

R5  = R 1 *M(BR) 


Comments 
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Labels 


Instructions: 

R2 

= M(SI) 

R6 

= R2*M(BI) 

R6 

= R5  - R6 

R5 

= R2*M(BR) 

R7 

= R1*M(BI) 

R7 

= R5  + R7 

R3 

= M(AR) 

R4 

= M(AI) 

M(XR) 

= R3  + R6 

M(XI) 

= R4  + R7 

M(YR) 

= R3  - R6 

M(YI) 

= R4  - R7 

4.  Remarks: 


a.  DA  operation  is  transparent  to  DP  operation 

b.  Coding  is  for  one  butterfly 

c.  For  1,024  Point  FFT  there  are  10  passes 


Pass 


Number  of 
CO/SI 


Number  of 
Butterflies/CO.SI 


3 

4 


1 

2 

4 

8 


512 

256 

128 

64 


10 


512 


d.  A,B  are  inputs  of  butterfly 
X,Y  are  outputs  of  butterfly 


Second  pass  uses  the  following: 
For  (CO.SI), 


Comments 


(CO.SI) 
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' - K •- 


TR 


T1 


; 


A 1 = 

XI 

and  B1 

= X2 

A2  = 

X3 

B2 

= X4 

etc 

A1  = 

Y 1 

HI 

= Y2 

A2  = 

Y3 

B2 

= Y4 

etc 

' 

multiple 

indexing  for  operand 

access. 

e.  The  reordering  of  final  results  is  not  included  because  Data  Addressing  can  perform 
this  operand  access  in  reversed  order  with  no  penalty. 


B2  FFT,  RAYTHEON  CODING 


1.  Benchmark:  FFT  1024  Points 


Algorithm: 

XR  = BR  + AR  X CR  - AI*CI 
XI  = BI  + AR*CI  + A1*CR 
YR  = BR  - AR  \ CR  + AR*CI 
YI  = Bl  - AR*CI  - AI*CR 


3.  Coding: 


Data 

In 

Scaling 

Multiply 

Accumulate 

Data 

Out 

Comments 

A I AR 
B2  BR 
A] 

Bl 

CR 

Cl 

Repeat 

512 

X 10  PASS 

Flush 

Ml  = AR*CR 
M2  = \I*CI 

M3  = AR*CI 
M4  = Al*CR 

51  = BR  + Ml  YR  = SI  + M2 

52  = BR  - Ml  XR  = S2  - M2 

53  = BI  + M3  YI  = S3  M4 

54  = BI  - M4  . YR  = S3  + M4 

XR 

YR 

XI 

YI 

4.  Remarks: 

a.  Coding  is  for  one  butterfly. 

b.  The  reordering  of  final  results  is  not  included. 
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B3  COORDINATE  CONVERSION  A.  TRACOR/LITTON  CODING 


1.  Benchmark:  Polar  to  Rectangular 

2.  Algorithm: 

X = R cos  0 
Y = R sin  0 

3.  Coding: 

Labels  Instructions:  Comments 


RO 

= 

0 

quadrant  indicator 

R1 

= 

M(0) 

R2 

= 

tt/2  - R1 

tt/2  - 9 

ALU 

= 

R1  - v/2 

IF  < 0 JUMP  SIN 

8 - n 12 

Rl  = n/2  - 9 

RO 

= 

RO  + 1 

R2 

= 

-R1 

9 -n/2 

R1 

= 

t r-  M(0) 

tt-8 

ALU 

=r 

-R1 

IF  < 0 JUMP  SIN 

8 - n 

Rl  = 9 -v/2 

RO 

= 

RO  + 1 

R1 

= 

R1  - tt/2 

8 • w 

R2 

5= 

3 tt/2  - M(0) 

3 tt/2  -0 

ALU 

= 

-R2 

IF  < 0 JUMP  SIN 

0-  3 rr/2 

Rl  = 3tt/2  -0 

RO 

sr 

RO  + 1 

R2 

= 

-R 1 

0 - 3v/2 

R 1 

ac 

2tt  - M(0) 

RE 

= 

2 

Set  up  2 passes 

R1 

= 

R1*R1 

02 

R3 

a 

M(K4)  + R1 

zo 

MPYB 

= 

ri  i 

Litton: 

R3 

= 

MPYB*R+  1 

R3  = R 1 *R3 

R3 

as 

R3  + M(K3) 

Zl 

MPYB 

s 

Rl  ) 

Litton: 

R3 

as 

MPYB*R3 f 

R3 = R1*R3 

R3 

as 

R3  + M(K2) 

z2 

MPYB 

a: 

Rl  \ 

Litton: 

R3 

as 

MPYB*R3  i 

R3 = R1*R3 

R3 

s 

R3  + M(K1) 

Z3,  (sin)  cos 

RE 

K 

RE  - 1 

IF  ZERO  JUMP  SIGN 

R4 

e 

R3 

Save  sin  0 

R1 

sc 

R2 

JUMP  MP 

for  cos 
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Labels 


Instructions: 


Comments 


1 


SIC.N : 


POL: 


ALU 

= 

RO 

IF  ZERO  JUMP  POL 

R3 

= 

-R3 

cos  = - sin 

RO 

= 

RO  - 1 

IF  ZERO  JUMP  POL 

R4 

= 

-R4 

sin  = - sin 

RO 

= 

RO  - 1 

IF  ZERO  JUMP  POL 

R3 

= 

-R3 

Litton: 

cos  = sin 

MPYB 

= 

M(R) 

) R5  = M(R) 

M(X) 

= 

R3*MPYB 

[ M(X)  = R5*R3 

M(Y) 

END 

— 

R4*\1PYB 

) M(Y)  = R5*R4 

4.  Remarks: 

a.  rr/2  etc  are  constants,  literal  operands 

b.  Sine  subroutine  uses  approximation 

c.  Cosine  uses  second  pass  through  sine  subroutine 

d.  Coding  included  quadrant  determination  of  0. 

B4  COORDINATE  CONVERSION  A,  RAYTHEON  CODING 

1.  Benchmark:  Polar  to  Rectangular  Coordinate  Conversion 

2.  Algorithm: 


X = R cos  9 
Y = R sin  9 


3.  Coding 


Data  In 

Scaling 

Multiply 

Accumulate 

Data  Out 

Comments 

91 

92 

R1 

R2 

Cos  91 

Cos  92 

Sin  01 

Sin  92 

; 

XI  = R 1 Cos  01 

X2  = R2  Cos  02 

Y 1 = R1  Sin  01 

Y2  = R2  Sin  02 

XI 

X2 

Y 1 

Y 2 

J 
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4.  Remarks: 


Coding  does  not  include  quadrant  determination  of  9. 

B5  COORDINATE  CONVERSION  B,  TRACOR/L1TTON  CODING 

1.  Benchmark:  Rectangular  to  Polar 

2.  Algorithm: 


R 

9 


= V X2  + Y2 

• -1  lY| 
sin  R 


77-  . .J  |Xl 

"=■  ' S,n  R 


HI 

R 


I 


Special  Purpose  Divide  Algorithm: 

Z = X/Y  Registers  R1/R2  = RO 

•y  = X (-y  - l)  + (-Y  -l)  2 4 


Assume  user  provides  test  that  |X|  ^ |Y| 


(1) 

If  X 

= 0 

then 

Z = 0 (even  if  Y 

= 0)  . ENl) 

(2) 

If  Y 

= 0 and  X > 0 

then 

Z = max 

. END 

(3) 

If  Y 

= 0 and  X < 0 

then 

Z = - max 

. END 

(4) 

If  Y 

= - max 

then 

z = Y + l 

. END 

(5) 

If  Y 

>0 

then 

complement  Y and 

X 

(6) 

If  Y 

-0.5 

then 

shift  left,  "count” 

until  in  range 

(7) 

Kx 

■ x (■'  4) 

note 

Y v 0 

(8) 

ky 

= (I  -Y): 

(°) 

Zo  = 

: KX 

(10) 

n = 

0 

sol  iteration  counter 

(ID 

Zn+1 

= Kx  + Ky  zn 

I 
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(12)  n = n+1  If  n# 7 JUMP  11 

(13)  Z = 2 Zn 

(14)  L shift  right  until  “count”  (6  above)  = 0 

(15)  END 

When  using  index  registers,  then 

R 1 = M (11) 

R2  = M (12) 

M(13)  = Rt) 

3.  Coding 

Labels  Instructions: 


Instructions: 

Comments 

RO 

M(XJ  l 

Litton 

Calculation  of  R 

RO 

RO*KO 1 

RO  = M(X)*M(X) 

HI 

R 1 

M(Y)  \ 

R 1 * R 1 1 

RO  = M(Y)*M(Y) 

R 1 

R 1 + RO 

CALL  SQ 

M(R)  = 
and 

RO 

M(T)  = 

R2 

Save  R2 

RO 

0 

nO  = V R 1 

R2 

R 1 ' IE  EQUAL  JUMP  SQOT 

Check  for  0 

ALU  = 

R2  IE  NEC! 

JUMP  PUS 

Check  for  NEC 

R 1 

R2  + 1 

2'  COM  PL. 

R2 

0 

Scaling 

RO 

R 1 - 0.5625 

IE  NEC.  JUMP  SQ1 

If  < 0.  5u52 

R2 

R2  - 1 

RO 

RO  SI  IK  2 

JUMP  SQ4 

RO 

RO  + 0,5 

JUMP  SQ3 

R 1 

R 1 SUL  1 

R 1 

R 1 Sill  1 

SUL  2 

R2 

R2  + 1 

Scaling 

RO 

R 1 - 0.0625 

ALU  = 

RO 

IE'  NEC  JUMP  SQ2 

If  0.0625 

M(U)  = 

R2 

Save  Scaling 

R2 

- 13 

Iter  count 

M(V)  = 

R 1 

Save  Value 

RO 

0.56 

Approx 

1 4(i 


Labels  Instructions:  Comments 


SQ5: 

K 1 

= RO  X RO 

z- 

RO 

= RO  - R1 

Z - Z" 

RO 

= M(V)  + RO 

+ N 

r: 

= r:  + i 

IF  NEC.  JUMP  SQ5 

r: 

= M(U) 

11-  Nl-c.  JUMP  SQ7 

Scaling 

ALU 

= r: 

11  ZERO  JUMP  SQOT 

r: 

= r:  - i 

SQ6: 

RO 

= RO  SHR  1 

r: 

= r:  - i 

11  ZERO  JUMP  SQ(> 

♦ 

JUMP  SQOT 

SQ7: 

RO 

= RO  SHL  1 

SQOT: 

r: 

= M(T) 

RETURN 

M(0) 

= 0 

Calculation  of  8 

r: 

= M(R) 

11  ZERO  JUMP  I NI) 

Center  • 

RO 

= 0 

R 1 

= M(V) 

11  ZERO  JUMP  QUAD 

R3 

= R 1 ‘ 

11  NEC.  JUMP  POS 

R 1 

= R3  + 1 

POS: 

CALL 

D1V 

R1/R2  = RO 

ALU 

= RO  - V 2/2 

IF  LESS  JUMP  C 2 

>77/4 

CALL 

ARC  SIN 

RO  = AVG 

QUAD: 

R 1 

= M(X) 

IF  NEC!  JUMP  Q2 

R2 

= M(V) 

IF  NEC.  JUMP  Q4 

1 / 

M (0) 

= RO 

JUMP  END 

+ + IA 

Q2 

r: 

= M(V) 

IF  NEC.  JUMP  Q3 

Mie> 

II 

7Z 

O 

JUMP  ENG 

+ 

Q3: 

M {8) 

= it  + RO 

JUMP  END 

VI 

Q4: 

M (8) 

= 2t t-  e 

+ - 1 

END 

C2: 

R 1 

= M(X) 

IF  ZERO  JUMP  C3 

RJ 

= R1‘ 

IF  NEG  JUMP  POS 

R! 

= R3  + 1 

POS: 

CALL 

D1V 

CALL 

ARC  SIN 

C3: 

RO 

= tt/2  - RO 

JUMP  QUAD 

DIVIDE 

RO  = R1/R2 

RO 

= R 1 

JUMP  END  IF  0 

X = 0 

R4 

= M(12) 

ALU 

= R2 

IF  NOT  ZERO  JUMP  D. 

V * 0 

II-  NOI  ZERO  JUMP  D:  V t 0 
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/ 


1 


Labels 

Instructions: 

Comments 

ALU 

S 

R 1 

IF  NFG  JUMP  Dl 

o 

II 

>* 

RO 

= 

MAX 

JUMP  LND 

Dl: 

RO 

= 

-MAX 

JUMP  FND 

D2: 

RO 

R 1 + 1 

- X 

r: 

= 

R2  + 1 

IF  OV  JUMP  FND 

Test  Y = -MAX 

7 — V 

ALU 

r: 

IF  NFC.  JUMP  D3 

4 - - A 

R 1 

= 

R 1 + 1 

For  NFG  Y 

r: 

= 

r:  + i 

D3: 

M(S) 

= 

R3 

R3 

= 

1 

1)4: 

ALU 

= 

R2  + 0.4099 

IF  NFC.  JUMP  1)5 

Test  Y > -.5 

R3 

= 

R3  + 1 

Bring  in  range 

r: 

= 

r:  shl.  i 

JUMP  D4 

1)5: 

M(X) 

= 

R4 

Mi  Y) 

= 

R5 

R4 

= 

r:  si ir  i 

- Y 2 

R4 

= 

R4  + (-MAX) 

- i - y /: 

MPYB 

= 

R1  \ 

Litton: 

*x 

R4 

= 

R4*M1’YB  f 

R4 = R4*R 1 

KX 

R5 

= 

R:  + (-MAX) 

- 1 - Y 

R5 

= 

R5*R5 

( ):  + KY 

RO 

= 

R4 

M(N) 

= 

RO 

RO 

= 

7 

Counter 

Do: 

Ml'YB 

= 

R5 

Litton: 

RO 

= 

RO*MPYB 

RO  = R0*R5 

RO 

= 

RO  + R4 

RO 

= 

RO  - I 

IF  NOT  Z1RO  JUMP 

1)0 

RO 

= 

RO  SHL  1 

:z 

1)7: 

R3 

= 

R3  - 1 

IF  ZFRO  JUMP  D8 

RO 

= 

RO  SHR  1 

JUMP  D7 

1)8: 

R3 

= 

M(S) 

R4 

= 

M(K) 

R5 

= 

M(Y) 

RO 

= 

M(N) 

FND: 

M(I3) 

= 

RO 

ARC  SIN: 

RO 

s 

RO*RO 

R 1 

= 

M(K4)  + RO 

MPYB 

= 

RO  ) 

Litton 

R 1 

= 

MPYB'Rl / 

R 1 = R0*R 1 

R 1 

= 

R 1 + M(K3) 

MPYB  = RO  \ 

R 1 = MPYB*R3) 

R 1 = R1  + M(K2) 

R 1 = R0*R 1 

MPYB  = RO  ) 

R 1 = R0*R 1 

R 1 = R1  + M(K 1 ) ) 

RETURN 
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4.  Remarks: 

a.  Arc  sine  subroutine  uses  same  approximation  algorithm  as  sine  but  different 
K values 

b.  Divide  subroutine  user  approximation 

c.  Coding  includes  quadrant  determination 

d.  Angle  8 is  calculated  and  not  determined  by  table  look-up. 

Bfr  COORDINATE  CONVKRSION  B.  RAYTHEON  CODING 

1.  Benchmark:  Rectangular  to  Polar  Coordinate  Conversion 

2.  Algorithm: 

8 = f (X.  Y)  X = R cos  8 Y = R sin  8 

R = X cos  8 + Y cos  8 = R cos~0  + R si n~8 

3.  Coding: 


4.  Remark 

Angle  8 is  determined  by  tabic  look-up  via  the  hardware  in  the  scaling  stage. 


F 
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B7  - CFAR,  BIT  PACKED,  TRACOR/LITTON  COPING 

1.  Benchmark:  Sliding  window  CFAR 

2.  Algorithm: 


a. 

SI 

256 

= £ * Xj 

0 

b. 

R1 

= SI  x K/257 

c. 

D1 

= Xj-.^  - T1  pack  into  16  bit  word 

d. 

S2 

= SI  + X257  - X0 

Repeat  b - d 

Window  B 

i o 1— ► Range  Cells 

ii  14  12 

♦ 

I — - P|  15  16  Decisions 

13  Result 

11  - 15  are  indices 


3.  Coding: 


Labels 

Instructions: 

Comments 

CFO: 

R0 

=: 

0 

For  Sum 

CFl: 

11 

= 

256 

R0 

= 

M(10  + 11) 

11  = 11  -1 

IF  NOT  ZHRO  JUMP  CFl 

11 

0 

12 

= 

257 

13 

= 

0 

14 

= 

128 

15 

= 

0 

CF2 

13 

__ 

13  + 1 

IF  15  ¥ 0 JUMP  CF3 

ALU 

= 

257  - 13 

IF  ZERO  JUMP  END  40%:  lo  + 

15 

= 

16 

R 1 

= 

0 

Temp 

R2 

= 

1 

Bit  Mask 

\ 
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4.  Remarks 

a.  Threahold  K/257  is  a constant,  not  a literal. 

b.  Range  cells  are  6 bit  unsigned  quantities. 

c.  256  cells  of  6 bits,  accumulated  rquires  14  bit  accumulator. 
B*  CFAR.  NON  BIT  PACKED.  RAYTHEON  CODING 

1.  Benchmark:  Sliding  window  CFAR 

2.  Algorithm: 

256 


SI 

- E x 

i 

S2 

= SI  + 

X257  - XI 

T1 

= K*S1 

D1 

= X127 

- T 1 

T2 

= K*S2 

D2 

= XI 28 

- T2 

Coding 

Labels 

Instructions: 

Comment 

CF3: 

R3 

= 

RO*M(K) 

T = K’Sum 

R3 

= 

R3  - M(10  + 14)  14  = 14  + 1 

T - X 

IF  NEC  JUMP  CF4 

R1 

=: 

R3  and  R2 

CF4: 

R2 

= 

SHL  1 15  = 15  - 1 

RO 

= 

RO  + M(10  + 12)  12  = 12  + 1 

Sum  + B 

RO 

= 

RO  - M(10  + 11)  11=11  + 1 

Sum  - A 

JUMP  CF2 

END 

4.  Remarks 

a.  Window  contains  an  even  number  of  cells.  Therefore  center  is  off  by  1/2  cell. 

b.  Constant  is  1/256  of  threshold. 

c.  Coding  may  require  double  length  arithmetic  to  accommodate  the  summation 
of  256  range  cells  of  6 bits  each. 


APPENDIX  C 


BENCHMARKS  - TIMING 


1 FFT,  Tracor/Litton 

2 FFT,  Raytheon 

3 Coordinate  Conversion  A,  Tracor/Litton 
Polar  to  Rectangular 

4 Coordinate  Conversion  A,  Raytheon 

/ Polar  to  Rectangular 

5 Coordinate  Conversion  B,  Tracor/Litton 
Rectangular  to  Polar 

6 Coordinate  Conversion  B,  Raytheon 
Rectangular  to  Polar 

7 CFAR,  bit  packed,  Tracor/Litton 

8 CFAR,  non  bit  packed,  Raytheon 

9 CFAR,  non  bit  packed,  Tracor/Litton 
Cl  - FFT,  TRACOR/LITTON  TIMING 

1.  Benchmark:  Fort  Fourier  Transformation  1024  Points 
Program 


Cycles 


Instructions  executed 


R 1 

= 

M(CO) 

R5 

= 

R1  *M(BR) 

R2 

= 

M(S1) 

R6 

= 

R2*M(BI) 

R6 

= 

R5-R6 

R5 

= 

R2*M(BR) 

R7 

= 

R1+M(B1) 

R5 

= 

R5+R7 

R3 

= 

M(AR) 

R4 

= 

M(AI) 

M(AR) 

= 

R3+R6 

M(A1) 

= 

R4+R7 

M(BR) 

ss 

R3-R6 

M(BI) 

= 

R4-R7 

153 


3.  Cycles  calculation: 


For  10  passes,  512  iterations  per  pass 


Total  = 10  passes  x 


5 1 2 iterations  _ 30  cycles 


iteration 


= 153,600  cycles  Tracor/Litton 


4.  Remarks: 


a.  Total  cycles  excludes  set-up 

b.  Program  executed  as  an  ln-Place  FFT  with  single  accumulator  and  multiplier. 

c.  Complex  In-Place  FFT  will  reduce  cycles  by  a factor  of  2 to  4. 

C2  FFT.  RAYTHFON  TIMING 

• If'iichm  irk'  Fort  Fouriei  Tran  'orm  ifior  1024  Points 
2.  Program 


1 MACRO  XR,  YR  = f(AR,  BR) 

3.  Cycles  calculation: 

2 cycles/clock,  whole  pipeline  is  tied  to  “external”  memory  access  where  other 

architectures  may  have  “internal”  memory  access  that  one  factor 

MACRO  TIMING  = 4 clocks  (MACRO  Clock)  = 8 cycles  (normalized) 

MACROS  = 5 1 2 — y.,|| '}-S  x 10  passes  + 3 Mp(^jfS 

= 5,120  + 3 = 5,123  MACROS 


= 5,123  MACROS  x 


8 cycles 

MACRO 


40,684  cycles  - Raytheon 


Removals: 


Program  executed  as  a Complex  In-Place  FFT  which  takes  advantage  multiple 
accumulation  and  multiplications. 

C3  COORDINATE  CONVERSION  A,  TRACOR/LITTON  TIMING 


1.  Benchmark:  Polar  to  Rectangular 

2.  Program:  Single  Point  R,  8 to  X,  Y 

Assume  R > 0,  8 in  second  quadrant 


Cycles 


Instructions  executed 


1 

R0 

= 

0 

1 

R1 

= 

M(0) 

1 

R2 

= 

7T/2-R1 

1 

ALU 

= 

Rl-rr/2 

ALU ''O  NO  JUMP 

1 

R0 

= 

R0  + 1 

1 

R2 

= 

-Rl 

*> 

Rl 

= 

tt-M(6?) 

■> 

ALU 

— 

-Rl 

ALU < 0 JUMP  SIN 

1 1 

1 

SIN: 

RH 

- 

MP: 

Rl 

= 

Rl  * Rl 

“) 

R3 

= 

M(K4)  + Rl 

1 § 

MPYB 

= 

Rl 

R3 

= 

MPYB  * R3 

R3 

= 

R3  + M(K3) 

i # 

MPYB 

= 

Rl 

R3 

= 

MPYB  * R3 

*) 

R3 

= 

R3  + M(K2) 

i § 

MPYB 

= 

Rl 

2 

R3 

= 

MPYB  * R3 

i 

*> 

R3 

= 

R3  + M(K  1 ) 

1 

*>-> 

RL 

= 

RL  - 1 

IF  = 0 JUMP  SIGN 

i 

R4 

= 

R3 

Rl 

= 

R2 

JUMP  MP 

IT 

1 

SKIN: 

ALU 

R0 

NO  JUMP 

1 

R3 

= 

-R3 

R0 

= 

R0  - 1 

JUMP  POL 

*> 

POL: 

MPYB 

= 

M(R) 

*> 

M(K) 

= 

R3  * MPYB 

2 

M(Y) 

= 

R4  * MPYB 

To 

LN1) 

3.  Cycles  calculation: 

Total  = 11  + 24  + 22  +10  = 67  cycles  - Tracer. 

# Subtract  3 X 1 cycles  tor  Litton;  64  cycles  - Litton 

4.  Remarks: 

a.  Array  processor  reduces  this  time  by  factor  of  2 to  4 

b.  This  example  shows  average  case  time 
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C 4 COORDINATE  CONVERSION  A,  RAYTHEON  TIMING 

1.  Benchmark:  Polar  to  Rectangular 


2.  Program:  Single  Point  R,  6 to  X,  Y 

2 conversions  per  MACRO 

3 MACROS  to  Hush 


3.  Cycles  calculation: 


Total 


4  MACROS  H cycles  _ 1 6 cycles  „ . 

2 conversions  A MACRO  conversion  a'  100,1 


4.  Remarks: 

I or  multiple  conversions.  Hush  of  pipe  becomes  less  significant  in  calculation  of 
total  c\ cles. 


( 5 (OORDINAI1  CONVERSION  B.  TRACOR/L1TTON  TIMING 


1.  Benchmark:  Rectangular  to  Polar 

2.  Program:  Single  Point  X,Y  to  R.0 

Assume  X 0,  Y 0 - second  quadrant 


Cycles 

2}  *3 
2 
■y 

IT 


1 

1 

2 

1 POS: 

2 

2 SQ1: 

1 SQ3: 

2 
1 


1 


Instructions  executed 


R0 

= 

M(K) 

R0 

= 

R0  * RU 

R 1 

= 

M(Y) 

R1 

= 

R 1 * R 1 

R1 

= 

R 1 + R0 

CALL  SO 

M(R) 

= 

R0 

HND 

M(T) 

= 

R2 

R0 

= 

0 

R2 

= 

R 1 

NO  JUMP 

ALU 

= 

R2 

JUMP  POS 

R2 

= 

0 

R0 

- 

R I - 0.5625 

JUMP  SQ1 

R0 

= 

R0  + 0.5 

JUMP  SQ3 

ALU 

= 

R0 

NO  JUMP 

M(U) 

= 

R2 

R2 

= 

-13 

M(V) 

= 

R1 

R0 

= 

0.56 

R 1 

= 

R0  * R0 

R0 

= 

R0  - R1 
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Cycles 


Instructions  executed 


2 2 

RO 

= 

M(V)  + RO 

2 1 

R2 

= 

R2  + 1 

JUMP  SQ5 

(12X7)  +6 

2 

R2 

= 

M(U) 

NO  JUMP 

1 

ALU 

= 

R2 

JUMP  SQOT 

3 

SQOT: 

R2 

= 

M(T) 

RETURN 

6 

Subtotal  for  R = 12 

+ 18  + (12  X 7)  + 6 + 

6 

= 12 

+ 18  + 84 

^ + 6 + 6 - 

1 26  cycles  - Tracor 

if  subtract  2 

— 

1 24  cycles  - Litton 

Cycles 

Instructions  executed 

2 

M(0) 

= 

0 

2 

R2 

= 

M(R) 

NO  JUMP 

1 

RO 

= 

0 

*> 

R1 

= 

M(Y) 

NO  JUMP 

2 

R3 

= 

R1 

JUMP  POS 

2 

POS 

CALL 

D1V 

1 

ALU 

= 

RO  - V 2/2 

NO  JUMP 

2 

CALL 

ARL 

SIN 

3 

QUAD 

R1 

= 

M(*) 

JUMP  Q2 

2 

Q2 

R2 

= 

M(Y) 

NO  JUMP 

3 

M(0) 

= 

v - RO 

JUMP  END 

END 

I 

D1V 

RO 

= 

R 1 

NO  JUMP 

ALU 

= 

R2 

JUMP  D2 

I 

D2 

RO 

= 

R1‘  + 1 

1 

r: 

= 

R2‘  + 1 

NO  JUMP 

*> 

ALU 

= 

R2 

JUMP  D3 

2 

D3 

M(S) 

= 

R3 

1 

R3 

= 

1 

2 

D4 

ALU 

= 

R2  + 0.49dd 

JUMP  D5 

2 

D5 

M(Y) 

= 

R4 

"> 

M(Y) 

= 

R5 

1 

R4 

= 

R2  SHR  1 

1 

R4 

= 

R4  + (-MAX) 

1 # 

MPYB 

= 

R1 

*> 

R4 

= 

R4  * MPYB 

1 

R5 

= 

R2  + (MAX) 

R5 

= 

R5  * R5 

1 

RO 

= 

R4 

M(N) 

= 

R6 

1 

R6 

— 

7 

28  ! # 

D6 

MPYB 

= 

R5 

*> 

RO 

= 

RO  * MPYB 

•3 
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Instructions  executed 


1 

R0 

= 

R0  + R4 

1 

R6 

= 

R6  - 1 

JUMP  D6 

5X7  =35 

R0 

= 

R0  SML  L 

a. 

R3 

- 

R3  - '1 

JUMP  D8 

2 D8 

R3 

= 

M(S) 

2 

R4 

= 

M(*) 

2 

R5 

= 

M(Y) 

3 

R6 

= 

M(N) 

RLTURN 

1 1 

1 

ARC  SIN: 

R0 

= 

R0  * R0 

2 

R1 

= 

M(K4)  + R0 

i 

f 

MPYB 

= 

R0 

2 

R 1 

= 

MPYB  * R1 

■> 

R1 

= 

R1  + M(K3) 

1 

# 

MPYB 

= 

R0 

R 1 

= 

MPYB  * R3 

R 1 

= 

R 1 + M(K2) 

1 

if 

Mi‘\  B 

= 

R0 

3 

R 1 

= 

R 1 + M(K  1 ) 

RLTURN 

77 

Subtotal  for  9 = 22  + 

28 

+ (5  x 

7) 

+11+17 

= 22  + 

28 

+ 35  + 

1 1 

+ 17  = 113 

cycles  - Tracor 

# = 22  + 

27 

+ 28  + 

1 1 

+ 14  = 102 

cyeles  - Litton 

4.  Cycles 

calculation: 

Total  = subtotal 

R 

+ Subtotal 

9 = 126  +113  = 239  cycles 

N 


5. 


C 6 


/ = 124  + 102  = 226  cycles  - Litton 

Remarks 

a.  Time  varies  depending  on  the  quadrant  of  X and  Y. 
h.  This  example  shows  average  case  time 

COORDINATI-  CONV1RSION  B,  RAYTHLON  TIMING 


Benchmark: 

Program: 


Rectangular  to  Polar 
Single  Point  X,  Y to  R .9 
2 conversions  per  MACRO 


1 S« 


3 MACROS  to  flush 


Cycles  calculation 

4 MACROS 


4. 

C 7 

1. 


Total  = 
Remarks 


2 converson 


8 cycles 
MACRO 


1 6 cycles 
conversion 


Raytheon 


For  multiple  conversions,  Hush  of  the  pipe  becomes  less  significant  in  calculation 
of  total  cycles. 

CFAR,  BIT  PACKKD,  TRACOR/L1TTON  TIMING 
Benchmark:  Sliding  window  CFAR 


Program: 

Assume 

4,096  range 

colls 

256  coll  window 

Cycles 

Instructions 

executed 

1 

R0 

0 

1 

CF1: 

+ 1 

256 

3 

R0 

M (10  + 11)  11 

= 11-1  IF  NOT 

4 x 

"256 

JUMP  CF 

1 

11 

0 

1 

12 

257 

1 

13 

0 

1 

14 

128 

1 

15 

0 

T 

1 2 2 

CF2: 

15  * 

0 

NO  JUMP/CF3 

1 1 

13 

13  + 1 

1 2 

ALU 

257  - 13 

NO  JUMP/LND 

1 5 

15 

16 

1 

Rl 

0 

1 

R2 

1 

3 3 

CF3: 

R3 

l R0  * M(K) 

3 3 

R3 

R3  - M (10  + 

14),  14  = 14  + 1, 

JUMP  CF 

1 1 

CF4: 

R2 

SML  1 15  = 15 

- 1 

2 2 

t- 

R0 

R0  + M 

3 3 

R0 

R0  - M 

JUMP  CF2 

18  14 

X 15 

= 210 
228  X 256 

58,368 
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3.  Cycles  calculation: 


Total  = 1 

4 x 256  = 1,024 

5 

58,368 

59,403  cycles  - Tracor/Litton 

4.  Remarks: 

a.  Algorithm  includes  packing  of  every  16  threshold  decisions  into  one  word 

b.  Processing  is  in  real-time,  single  pass. 


( 8 (TAR,  NON  HIT  P ACRID.  RAYTHEON  TIMING 


Benchmark:  Sliding  window  ('FAR 

° ' ram  Vaire  4 •'■>(>  nngc  ells 

25(>  cell  window 


MACRO 

MACRO 

FLUSH 

MACRO 

FLUSH 


64  times 
2,048  times 

3 

2,048  times 

4 


4, 1 66 


3.  Cycles  calculation 


Total 


4,166  MACROS  x 


8 cycles 
MACRO 


33,328  cycles 


Raytheon 


4.  Remarks: 

a.  Kach  threshold  decision  is  in  a separate  word. 

b.  Two  decision  per  MACRO. 

c.  Processing  is  not  in  real-time.  A complete  sweep  of  all  range  cells  is 
required  before  processing  begins. 

d.  Interim  result  requires  4096  words  of  temporary  storage. 

e.  Accumulation  of  256  cells  (window),  each  of  a 6-bit  unsigned  quantity,  into  a 
1 2 bit  accumulator  may  be  a problem. 

t.  Bit  packed  CFAR  would  require  more  MACROS  and  would  be  significantly  slower. 
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—  C-FAR’  NQN  BIT  PACKED,  TRACOR/LITTON  TIMING 

1.  Benchmark:  Sliding  Window  CFAR 

— Program:  Assure  4,096  range  cells 

256  cell  window 


Window 


11 


14 


13 


Cycles 


Instructions  executed 


Range 


INDEX 


-*  ■ -jo 

I 

1 

1 

1 

■> 

T 1 2 

3 

T 

2 

9 x 4096 


CEO:  RO 

CE1:  II 

RO 

11 

12 

13 

14 
RF 

CF2:  ALU 

M(I3),  R3 

RO 

RO 

END 


0 

255 

M (10  * II)  11=11-1  IF  i JUMP  CF1 
0 

255 

0 

127 

M(K) 

257  - 13  IFO  JUMP  END 

RO  * RF  13  = 13  + 1 

RO  + M (JO  + 12)  12  = 12+1 

RO  - M (10  + II)  II  = l|  + i JUMP  CF2 


3.  Cycles  calculation: 


Total  I 

4 X 256  = 1,024 

9 x 4,096  = 36,864 


37,897  cycles  - Tracor/Litton 


4.  Remarks 


a.  Algorithm  uses  one  word  for  each  threshold  decision 

b.  Processing  is  in  real-time,  single  pass. 
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